Setting the base address for a 64-bit BAR requires two separate 32-bit
writes to configuration space, and so will necessarily result in the
BAR temporarily holding an invalid partially written address.
Some hypervisors (observed on an AWS EC2 c7a.medium instance in
eu-west-2) will assume that guests will write BAR values only while
decoding is disabled, and may not rebuild MMIO mappings for the guest
if the BAR registers are written while decoding is enabled. The
effect of this is that MMIO accesses are not routed through to the
device even though inspection from within the guest shows that every
single PCI configuration register has the correct value. Writes to
the device will be ignored, and reads will return the all-ones pattern
that typically indicates a nonexistent device.
With the ENA network driver now using low latency transmit queues,
this results in the transmit descriptors being lost (since the MMIO
writes to BAR2 never reach the device), which in turn causes the
device to lock up as soon as the transmit doorbell is rung for the
first time.
Fix by disabling decoding of memory and I/O cycles while setting a BAR
address (as we already do while sizing a BAR), so that the invalid
partial address can never be decoded and so that hypervisors will
rebuild MMIO mappings as expected.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Experiments suggest that the instance type is exposed via the SMBIOS
product name. Include this information within the default output,
since it is often helpful in debugging.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The queue base address is meaningless for a low latency queue, since
the queue entries are written directly to the on-device memory. Any
non-zero queue base address will be safely ignored by the hardware,
but leaves open the possibility that future revisions could treat it
as an error.
Leave this field as zero, to match the behaviour of the Linux driver.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On some newer (7th and 8th generation) instance types, the 32-bit
build of iPXE cannot access PCI configuration space since the ECAM is
placed outside of the 32-bit address space. The visible symptom is
that iPXE fails to detect any network devices.
The public AMIs are all now built as 64-bit binaries, but there is
nothing that prevents the building and importing of a 32-bit AMI.
There are still potentially valid use cases for 32-bit AMIs (e.g. if
planning to use the AMI only for older instance types), and so we
cannot sensibly prevent this error at build time.
Display the build architecture as part of the AWS EC2 embedded script,
to at least allow for easy identification of this particular failure
mode at run time.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Making images public is blocked by default in new AWS regions. Remove
this block automatically whenever creating a public image.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit a801244 ("[ena] Increase receive ring size to 128 entries")
increased the receive ring size to 128 entries (while leaving the fill
level at 16), since using a smaller receive ring caused unexplained
failures on some instance types.
The original hardware bug that resulted in that commit seems to have
been fixed: experiments suggest that the original failure (observed on
a c6i.large instance in eu-west-2) will no longer reproduce when using
a receive ring containing only 16 entries (as was the case prior to
that commit).
Newer generations of the ENA hardware (observed on an m8i.large
instance in eu-south-2) seem to have a new and exciting hardware bug:
these instance types appear to use a hash of the received packet
header to determine which portion of the (out-of-order) receive ring
to use. If that portion of the ring happens to be empty (e.g. because
only 32 entries of the 128-entry ring are filled at any one time),
then the packet will be silently dropped.
Work around this new hardware bug by reducing the receive ring size
down to the current fill level of 32 entries. This appears to work on
all current instance types (but has not been exhaustively tested).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid running out of transmit descriptors when sending TCP ACKs by
increasing the transmit queue size to match the increased received
fill level.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Experiments suggest that at least some instance types (observed with
c6i.large in eu-west-2) experience high packet drop rates with only 16
receive buffers allocated. Increase the fill level to 32 buffers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Newer generations of the ENA hardware require the use of low latency
transmit queues, where the submission queues and the initial portion
of the transmitted packet are written to on-device memory via BAR2
instead of being read from host memory.
Detect support for low latency queues and set the placement policy
appropriately. We attempt the use of low latency queues only if the
device reports that it supports inline headers, 128-byte entries, and
two descriptors prior to the inlined header, on the basis that we
don't care about using low latency queues on older versions of the
hardware since those versions will support normal host memory
submission queues anyway.
We reuse the redundant memory allocated for the submission queue as
the bounce buffer for constructing the descriptors and inlined packet
data, since this avoids needing a separate allocation just for the
bounce buffer.
We construct a metadata submission queue entry prior to the actual
submission queue entry, since experimentation suggests that newer
generations of the hardware require this to be present even though it
conveys no information beyond its own existence.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Avoid spurious assertion failures by ensuring that references to
uncompleted transmit buffers are not retained after the device has
been closed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Newer generations of the ENA hardware require the use of low latency
transmit queues, where the submission queues and the initial portion
of the transmitted packet are written to on-device memory via BAR2
instead of being read from host memory.
Prepare for this by mapping the on-device memory BAR. As with the
register BAR, we may need to steal a base address from the upstream
PCI bridge since the BIOS on some instance types (observed with an
m8i.metal-48xl instance in eu-south-2) will fail to assign an address
to the device.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use pci_bar_set() when we need to set a device base address (on
instance types such as c6i.metal where the BIOS fails to do so), so
that 64-bit BARs will be handled automatically.
This particular issue has so far been observed only on 6th generation
instances. These use 32-bit BARs, and so the lack of support for
handling 64-bit BARs has not caused any observable issue.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide pci_bar_set() to handle setting the base address for a
potentially 64-bit BAR, and rewrite pci_bar_size() to correctly handle
sizing of 64-bit BARs.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RFC 7627 states that renegotiation becomes no longer secure under
various circumstances when the non-extended master secret is used.
The description of the precise set of circumstances is spread across
various points within the document and is not entirely clear.
Avoid a superset of the circumstances in which renegotiation
apparently becomes insecure by refusing renegotiation completely
unless the extended master secret is used.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RFC 7627 section 5.3 states that the client must abort the handshake
if the server attempts to resume a session where the master secret
calculation method stored in the session does not match the method
used for the connection being resumed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RFC 7627 defines the Extended Master Secret (EMS) as an alternative
calculation that uses the digest of all handshake messages rather than
just the client and server random bytes.
Add support for negotiating the Extended Master Secret extension and
performing the relevant calculation of the master secret.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The calculation for the extended master secret as defined in RFC 7627
relies upon the digest of all handshake messages up to and including
the Client Key Exchange.
Facilitate this calculation by generating the master secret only after
sending the Client Key Exchange message.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Experimentation suggests that rearming the interrupt once per observed
completion is not sufficient: we still see occasional delays during
which the hardware fails to write out completions.
As described in commit d2e1e59 ("[gve] Use dummy interrupt to trigger
completion writeback in DQO mode"), there is no documentation around
the precise semantics of the interrupt rearming mechanism, and so
experimentation is the only available guide. Switch to rearming both
TX and RX interrupts unconditionally on every poll, since this
produces better experimental results.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The DQO-QPL operating mode uses registered queue page lists but still
requires the raw DMA address (rather than the linear offset within the
QPL) to be provided in transmit and receive descriptors.
Set the queue page list base device address appropriately.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The hardware reports descriptor and packet completions separately for
the transmit ring. We currently ignore descriptor completions (since
we cannot free up the transmit buffers in the queue page list and
advance the consumer counter until the packet has also completed).
Now that transmit completions are written out immediately (instead of
being delayed until 128 bytes of completions are available), there is
no value in retaining the descriptor completions.
Omit descriptor completions entirely, and reduce the transmit fill
level back down to its original value.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When operating in the DQO operating mode, the device will defer
writing transmit and receive completions until an entire internal
cacheline (128 bytes) is full, or until an associated interrupt is
asserted. Since each receive descriptor is 32 bytes, this will cause
received packets to be effectively delayed until up to three further
packets have arrived. When network traffic volumes are very low (such
as during DHCP, DNS lookups, or TCP handshakes), this typically
induces delays of up to 30 seconds and results in a very poor user
experience.
Work around this hardware problem in the same way as for the Intel
40GbE and 100GbE NICs: by enabling dummy MSI-X interrupts to trick the
hardware into believing that it needs to write out completions to host
memory.
There is no documentation around the interrupt rearming mechanism.
The value written to the interrupt doorbell does not include a
consumer counter value, and so must be relying on some undocumented
ordering constraints. Comments in the Linux driver source suggest
that the authors believe that the device will automatically and
atomically mask an MSI-X interrupt at the point of asserting it, that
any further interrupts arriving before the doorbell is written will be
recorded in the pending bit array, and that writing the doorbell will
therefore immediately assert a new interrupt if needed.
In the absence of any documentation, choose to rearm the interrupt
once per observed completion. This is overkill, but is less impactful
than the alternative of rearming the interrupt unconditionally on
every poll.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Ensure that remainder of completion records are read only after
verifying the generation bit (or sequence number).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the default dummy MSI-X target address that is now allocated and
configured automatically by pci_msix_enable().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Interrupts as such are not used in iPXE, which operates in polling
mode. However, some network cards (such as the Intel 40GbE and 100GbE
NICs) will defer writing out completions until the point of asserting
an MSI-X interrupt.
From the point of view of the PCI device, asserting an MSI-X interrupt
is just a 32-bit DMA write of an opaque value to an opaque target
address. The PCI device has no know to know whether or not the target
address corresponds to a real APIC.
We can therefore trick the PCI device into believing that it is
asserting an MSI-X interrupt, by configuring it to write an opaque
32-bit value to a dummy target address in host memory. This is
sufficient to trigger the associated write of the completions to host
memory.
Allocate a dummy target address when enabling MSI-X on a PCI device,
and map all interrupts to this target address by default.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Select a preferred operating mode from those advertised as supported
by the device, falling back to the oldest known mode (GQI-QPL) if
no modes are advertised.
Since there are devices in existence that support only QPL addressing,
and since we want to minimise code size, we choose to always use a
single fixed ring buffer even when using raw DMA addressing. Having
paid this penalty, we therefore choose to prefer QPL over RDA since
this allows the (virtual) hardware to minimise the number of page
table manipulations required. We similarly prefer GQI over DQO since
this minimises the amount of work we have to do: in particular, the RX
descriptor ring contents can remain untouched for the lifetime of the
device and refills require only a doorbell write.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for the "DQO" out-of-order transmit and receive queue
formats. These are almost entirely different in format and usage (and
even endianness) from the original "GQI" in-order transmit and receive
queues, and arguably should belong to a completely different device
with a different PCI ID. However, Google chose to essentially crowbar
two unrelated device models into the same virtual hardware, and so we
must handle both of these device models within the same driver.
Most of the new code exists solely to handle the differences in
descriptor sizes and formats. Out-of-order completions are handled
via a buffer ID ring (as with other devices supporting out-of-order
completions, such as the Xen, Hyper-V, and Amazon virtual NICs). A
slight twist is that on the transmit datapath (but not the receive
datapath) the Google NIC provides only one completion per packet
instead of one completion per descriptor, and so we must record the
list of chained buffer IDs in a separate array at the time of
transmission.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We cancel any pending transmissions when (re)starting the device since
any transmissions that were initiated before the admin queue reset
will not complete.
The network device core will also cancel any pending transmissions
after the device is closed. If the device is closed with some
transmissions still pending and is then reopened, this will therefore
result in a stale I/O buffer being passed to netdev_tx_complete_err()
when the device is restarted.
This error has not been observed in practice since transmissions
generally complete almost immediately and it is therefore unlikely
that the device will ever be closed with transmissions still pending.
With out-of-order queues, the device seems to delay transmit
completions (with no upper time limit) until a complete batch is
available to be written out as a block of 128 bytes. It is therefore
very likely that the device will be closed with transmissions still
pending.
Fix by ensuring that we have dropped all references to transmit I/O
buffers before returning from gve_close().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Handle async events related to link speed change, link speed config
change, and port phy config changes.
Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
The descriptors and completions in the DQO operating mode are not the
same sizes as the equivalent structures in the GQI operating mode.
Allow the queue stride size to vary by operating mode (and therefore
to be known only after reading the device descriptor and selecting the
operating mode).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Rename data structures and constants that are specific to the GQI
operating mode, to allow for a cleaner separation from other operating
modes.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently assume that the buffer index is equal to the descriptor
ring index, which is correct only for in-order queues.
Out-of-order queues will include a buffer tag value that is copied
from the descriptor to the completion. Redefine the data buffers as
being indexed by this tag value (rather than by the descriptor ring
index), and add a circular ring buffer to allow for tags to be reused
in whatever order they are released by the hardware.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Raw DMA addressing allows the transmit and receive descriptors to
provide the DMA address of the data buffer directly, without requiring
the use of a pre-registered queue page list. It is modelled in the
device as a magic "raw DMA" queue page list (with QPL ID 0xffffffff)
covering the whole of the DMA address space.
When using raw DMA addressing, the transmit and receive datapaths
could use the normal pattern of mapping I/O buffers directly, and
avoid copying packet data into and out of the fixed queue page list
ring buffer. However, since we must retain support for queue page
list addressing (which requires this additional copying), we choose to
minimise code size by continuing to use the fixed ring buffer even
when using raw DMA addressing.
Add support for using raw DMA addressing by setting the queue page
list base device address appropriately, omitting the commands to
register and unregister the queue page lists, and specifying the raw
DMA QPL ID when creating the TX and RX queues.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for the existence of a queue page list where the base device
address is non-zero, as will be the case for the raw DMA addressing
(RDA) operating mode.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The "create TX queue" and "create RX queue" commands have fields for
the descriptor and completion ring sizes, which are currently left
unpopulated since they are not required for the original GQI-QPL
operating mode.
Populate these fields, and allow for the possibility that a transmit
completion ring exists (which will be the case when using the DQO
operating mode).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The GVE family supports two incompatible descriptor queue formats:
* GQI: in-order descriptor queues
* DQO: out-of-order descriptor queues
and two addressing modes:
* QPL: pre-registered queue page list addressing
* RDA: raw DMA addressing
All four combinations (GQI-QPL, GQI-RDA, DQO-QPL, and DQO-RDA) are
theoretically supported by the Linux driver, which is essentially the
only public reference provided by Google. The original versions of
the GVE NIC supported only GQI-QPL mode, and so the iPXE driver is
written to target this mode, on the assumption that it would continue
to be supported by all models of the GVE NIC.
This assumption turns out to be incorrect: Google does not deem it
necessary to retain backwards compatibility. Some newer machine types
(such as a4-highgpu-8g) support only the DQO-RDA operating mode.
Add a definition of operating mode, and pass this as an explicit
parameter to the "configure device resources" admin queue command. We
choose a representation that subtracts one from the value passed in
this command, since this happens to allow us to decompose the mode
into two independent bits (one representing the use of DQO descriptor
format, one representing the use of QPL addressing).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Linux driver occasionally uses the terminology "packet descriptor"
to refer to the portion of the descriptor excluding the buffer
address. This is not a helpful separation, and merely adds
complexity.
Simplify the code by removing this artifical separation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide space for the device to return its list of supported options.
Parse the option list and record the existence of each option in a
support bitmask.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support to advertise adapter error recovery support to the
firmware. Implement error recovery operations if adapter fault is
detected. Refactor memory allocation to better align with probe and
open functions.
Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
Some systems (observed with a Lenovo X1) fail to populate the loaded
image device path with a Uri() component when performing a UEFI HTTP
boot, instead creating a broken loaded image device path that
represents a DHCP+TFTP boot that has not actually taken place.
If no URI is found within the loaded image device path, then fall back
to looking for a URI within the current boot option.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
An EFI boot option (stored in a BootXXXX variable) comprises an
EFI_LOAD_OPTION structure, which includes some undefined number of EFI
device paths. (The structure is extremely messy and awkward to parse
in C, but that's par for the course with EFI.)
Add a function to extract the first device path from an EFI load
option, along with wrapper functions to read and extract the first
device path from an EFI boot variable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The chainloaded-device-only "snponly" driver already drags in support
for driving SNP, NII, and MNP devices, on the basis that the user
generally doesn't care which UEFI API is used and just wants to boot
from the same network device that was used to load iPXE.
The multi-device "snp" driver already drags in support for driving SNP
and NII devices, but does not drag in support for MNP devices.
There is essentially zero code size overhead to dragging in support
for MNP devices, since this support is always present in any iPXE
application build anyway (as part of the code to download
"autoexec.ipxe" prior to installing our own drivers).
Minimise surprise by dragging in support for MNP devices whenever
using the "snp" driver, following the same reasoning used for the
"snponly" driver.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
DesignWare GPIO port numbers are represented as unsized single-entry
regions. Use fdt_reg() to obtain the GPIO port number, rather than
requiring access to a region cell size specification stored in the
port group structure.
This allows the field name "regs" in the port group structure to be
repurposed to hold the I/O register base address, which then matches
the common usage in other drivers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Many region types (e.g. I2C bus addresses) can only ever contain a
single region with no size cells specified. Provide fdt_reg() to
reduce boilerplate in this common use case.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commands were originally ordered by functional group (e.g. keeping the
image management commands together), with arrays used to impose a
functionally meaningful order within the group.
As the number of commands and functional groups has expanded over the
years, this has become essentially useless as an organising principle.
Switch to sorting commands alphabetically (using the linker table
mechanism).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The "md5sum" and "sha1sum" commands were originally intended solely as
debugging utilities, and would return success (with a warning message)
even if the specified images did not exist.
To minimise surprise and to be consistent with other commands, treat
the inability to acquire an image as a fatal error.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow the result of a digest calculation to be stored in a named
setting. This allows for digest verification in scripts using e.g.:
set expected:hexraw cb05def203386f2b33685d177d9f04e3e3d70dd4
sha1sum --set actual 1mb
iseq ${expected} ${actual} || goto checksum_bad
Note that digest verification alone cannot be used to set the trusted
execution status of an image. The only way to mark an image as
trusted is to use the "imgverify" command.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add Christian Nilsson <nikize@gmail.com> as a project sponsorship
recipient, to reflect the enormous amount of time invested in
responding to issues and pull requests.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add "sha256sum", "sha512sum", and similar commands. Include these new
commands only when DIGEST_CMD is enabled in config/general.h and the
corresponding algorithm is enabled in config/crypto.h.
Leave "mdsum" and "sha1sum" included whenever only DIGEST_CMD is
enabled, to avoid potentially breaking backwards compatibility with
builds that disabled MD5 or SHA-1 as a TLS or X.509 digest algorithm,
but would still have expected those commands to be present.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Consumption of phandles will be in the form of locating a functional
device (e.g. a GPIO device, or an I2C device, or a reset controller)
by phandle, rather than locating the device tree node to which the
phandle refers.
Repurpose fdt_phandle() to obtain the phandle value (instead of
searching by phandle), and record this value as the bus location
within the generic device structure.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Read and display the core version immediately after mapping the MMIO
registers, to provide a basic sanity check that the registers have
been correctly mapped and the core is not held in reset.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove unnecessary driver specific macros. Use standard
pci_read_config_xxxx, pci_write_config_xxx, writel/q calls.
Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
When debugging is enabled for the device tree or memory map parsing
code, the active serial console UART variable will be accessed during
early initialisation, before the .bss section has been zeroed.
Place this variable in the .data section (by providing an explicit
initialiser), so that reading this variable is well defined even
during early initialisation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Variables in the .bss section cannot be relied upon to have zero
values during early initialisation, before we have relocated ourselves
to somewhere suitable in RAM and zeroed the .bss section.
Place any explicitly zero-initialised variables in the .data section
rather than in .bss, so that we can rely on their values even during
this early initialisation stage.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On startup, we may be running from read-only memory, and therefore
cannot zero the .bss section (or write to the .data section) until we
have parsed the system memory map and relocated ourselves to somewhere
suitable in RAM. The code that runs during this early initialisation
stage must be carefully written to avoid writing to the .data section
and to avoid reading from or writing to the .bss section.
Detecting code that erroneously writes to the .data or .bss sections
is relatively easy since running from read-only memory (e.g. via
QEMU's -pflash option) will immediately reveal the bug. Detecting
code that erroneously reads from the .bss section is harder, since in
a freshly powered-on machine (or in a virtual machine) there is a high
probability that the contents of the memory will be zero even before
we explicitly zero out the section.
Add the ability to fill the .bss section with an invalid non-zero
value to expose bugs in early initialisation code that erroneously
relies upon variables in .bss before the section has been zeroed. We
use the value 0xeb55eb55eb55eb55 ("EBSS") since this is immediately
recognisable as a value in a crash dump, and will trigger a page fault
if dereferenced since the address is in a non-canonical form.
Poisoning the .bss can be done only when the image is known to already
reside in writable memory. It will overwrite the relocation records,
and so can be done only on a system where relocation is known to be
unnecessary (e.g. because paging is supported). We therefore do not
enable this behaviour by default, but leave it as a configurable
option via the config/fault.h header.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
PCI Express devices do not have physical INTx output signals, and on
modern motherboards there is unlikely to be any interrupt controller
with physical interrupt input signals. There are multiple levels of
abstraction involved in emulating the legacy INTx interrupt mechanism:
the PCIe device sends Assert_INTx and Deassert_INTx messages, PCIe
bridges and switches must collate these virtual wires, and the root
complex must map the virtual wires into messages that can be
understood by the host's emulated 8259 PIC.
This complex chain of emulations is rarely tested on modern hardware,
since operating systems will invariably use MSI-X for PCI devices and
the I/O APIC for non-PCI devices such as the real-time clock. Since
the legacy interrupt emulation mechanism is rarely tested, it is
frequently unreliable. We have encountered many issues over the years
in which legacy interrupts are simply not raised as expected, even
when inspection shows that the device believes it is asserting an
interrupt and the controller believes that the interrupt is enabled.
We already maintain a list of devices that are known to fail to
generate legacy interrupts correctly. This list is based on the PCI
vendor and device IDs, which is not necessarily a fair test since the
root cause may be a board-level misconfiguration rather than a
device-level fault.
Assume that any PCI Express device has a high chance of not being able
to raise legacy interrupts reliably. This is a relatively intrusive
change since it will affect essentially all modern network devices,
but should hopefully fix all future issues with non-functional legacy
interrupts, without needing to constantly grow the list of known
broken devices.
If some PCI Express devices are found to fail when operated in polling
mode, then this change will need to be revisited.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In the case of a misbehaving PXE stack, it is often useful to know the
PCI vendor and device IDs (e.g. for adding the device to the list of
devices with known broken support for generating interrupts).
The PCI vendor and device ID is already available to the prefix code,
and so can trivially be printed out. Add this information to the PXE
prefix startup banner.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a basic driver for the DesignWare USB3 host controller as found in
the Lichee Pi 4A.
This driver covers only the DesignWare host controller hardware. On
the Lichee Pi 4A, this is sufficient to get the single USB root hub
port (exposed internally via the SODIMM connector) up and running.
The driver does not yet handle the various GPIOs that control power
and signal routing for the Lichee Pi 4A's onboard VL817 USB hub and
the four physical USB-A ports. This therefore leaves the USB hub and
the USB-A ports unpowered, and the USB2 root hub port routed to the
physical USB-C port. Devices plugged in to the USB-A ports will not
be powered up, and a device plugged in to the USB-C port will
enumerate as a USB2 device.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently use the downstream hub's port number to determine the
xHCI slot type for a newly connected USB device. The downstream hub
port number is irrelevant to the xHCI controller's supported protocols
table: the relevant value is the number of the root hub port through
which the device is attached.
Fix by using the root hub port number instead of the immediate parent
hub's port number.
This bug has not previously been detected since the slot type for the
first N root hub ports will invariably be zero to indicate that these
are USB ports. For any xHCI controller with a sufficiently large
number of root hub ports, the code would therefore end up happening to
calculate the correct slot type value despite using an incorrect port
number.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The WaitForKeyEx event in EFI_SIMPLE_TEXT_INPUT_EX_PROTOCOL is
redundant: by definition it has to signal under exactly the same
conditions as the WaitForKey event in EFI_SIMPLE_TEXT_INPUT_PROTOCOL
and cannot provide any "extended" information since EFI events do not
convey any information beyond their own occurrence.
UEFI keyboard drivers such as Ps2KeyboardDxe and UsbKbDxe invariably
use a single notification function to implement both events. The
console multiplexer driver ConSplitterDxe uses a single notification
function for both events, which ends up checking only the WaitForKey
event on the underlying console devices. (Since all console input is
routed through the console multiplexer, this means that in practice
nothing will ever check the underlying devices' WaitForKeyEx events.)
UEFI console consumers such as the UEFI shell tend to use only the
EFI_SIMPLE_TEXT_INPUT_PROTOCOL instance provided as ConIn in the EFI
system table. With the exception of the UEFI text editor (the "edit"
command in the UEFI shell), almost nothing bothers to open the
EFI_SIMPLE_TEXT_INPUT_EX_PROTOCOL instance on the same handle.
The Lenovo ThinkPad T14s Gen 5 has a very peculiar firmware bug.
Enabling the "UEFI Wi-Fi Network Boot" feature in the BIOS setup will
cause the completely unrelated WaitForKeyEx event pointer to be
overwritten with a pointer to a FAT_DIRENT structure representing the
"BOOT" directory in the EFI system partition. This happens with 100%
repeatability. It is not necessary to attempt to boot from Wi-Fi: it
is only necessary to have the feature enabled. The root cause is
unknown, but is presumably an uninitialised pointer or similar
memory-related bug in Lenovo's UEFI Wi-Fi driver.
Work around this Lenovo firmware bug by checking only the WaitForKey
event, ignoring the WaitForKeyEx event even if we will subsequently
use ReadKeyStrokeEx() to read the keypress. Since almost all other
UEFI console consumers use only WaitForKey, this ensures that we will
be using code paths that the firmware vendor is likely to have tested
at least once.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As with EFI_HANDLE, the EFI headers define EFI_EVENT as a void
pointer, rendering EFI_EVENT compatible with a pointer to itself and
hence guaranteeing that pointer type bugs will be introduced.
Redefine EFI_EVENT as a pointer to an anonymous structure (as we
already do for EFI_HANDLE) to allow the compiler to perform type
checking as expected.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UEFI model for wireless network boot cannot sensibly be described
without cursing. Commit 758a504 ("[efi] Inhibit calls to Shutdown()
for wireless SNP devices") attempts to work around some of the known
issues.
Experimentation shows that on at least some platforms (observed with a
Lenovo ThinkPad T14s Gen 5) the vendor SNP driver is broken to the
point of being unusable in anything other than the single use case
envisioned by the firwmare authors. Doing almost anything directly
via the SNP protocol interface has a greater than 50% chance of
locking up the system.
Assume, in the absence of any evidence to the contrary so far, that
vendor SNP drivers for wireless network devices are so badly written
as to be unusable. Refuse to even attempt to interact with these
drivers via the SNP or NII protocol interfaces.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is nothing in the current versions of the UEFI specification
that limits the TPL at which we may call ConnectController() or
DisconnectController(). However, at least some platforms (observed
with a Lenovo ThinkPad T14s Gen 5) will occasionally and unpredictably
lock up before returning from ConnectController() if called at a TPL
higher than TPL_APPLICATION.
Work around whatever defect is present on these systems by dropping to
the current external TPL for all calls to ConnectController() or
DisconnectController().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Set the appropriate Svpbmt type bits within page table entries if the
extension is supported. Tested only in QEMU so far, due to the lack
of availability of real hardware supporting Svpbmt.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Reuse the code that creates I/O device page mappings to create the
coherent DMA mapping of the 32-bit address space on demand, instead of
constructing this mapping as part of the initial page table.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
All 64-bit paging schemes support at least 1GB "gigapages". Use these
to map I/O devices instead of 2MB "megapages". This reduces the
number of consumed page table entries, increases the visual similarity
of I/O remapped addresses to the underlying physical addresses, and
opens up the possibility of reusing the code to create the coherent
DMA map of the 32-bit address space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The data cache must be invalidated twice for RX DMA buffers: once
before passing ownership to the DMA device (in case the cache happens
to contain dirty data that will be written back at an undefined future
point), and once after receiving ownership from the DMA device (in
case the CPU happens to have speculatively accessed data in the buffer
while it was owned by the hardware).
Only the used portion of the buffer needs to be invalidated after
completion, since we do not care about data within the unused portion.
Update the DMA API to include the used length as an additional
parameter to dma_unmap(), and add the necessary second cache
invalidation pass to the RISC-V DMA API implementation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a RISC-V assembly language implementation of TCP/IP checksumming,
which is around 50x faster than the generic algorithm. The main loop
checksums aligned xlen-bit words, using almost entirely compressible
instructions and accumulating carries in a separate register to allow
folding to be deferred until after all loops have completed.
Experimentation on a C910 CPU suggests that this achieves around four
bytes per clock cycle, which is comparable to the x86 implementation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide an implementation of dma_map() that performs cache clean or
invalidation as required, and an implementation of dma_alloc() that
returns virtual addresses within the coherent mapping of the 32-bit
physical address space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Cache management operations must generally be performed on virtual
addresses rather than physical addresses.
Change the address parameter in dma_map() to be a virtual address, and
make dma() the API-level primitive instead of dma_phys().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Generating an isohybrid image with `xorrisofs` is supposed to happen
with option `-isohybrid-gpt-basdat`, not command `isohybrid`.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On platforms where DMA devices are not in the same coherency domain as
the CPU cache, it is necessary to be able to explicitly clean the
cache (i.e. force data to be written back to main memory) and
invalidate the cache (i.e. discard any cached data and force a
subsequent read from main memory).
Add support for cache management via the standard Zicbom extension or
the T-Head cache management operations extension, with the supported
extension detected on first use.
Support cache management operations only on I/O buffers, since these
are guaranteed to not share cachelines with other data.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On platforms where DMA devices are not in the same coherency domain as
the CPU cache, we must ensure that DMA I/O buffers do not share
cachelines with other data.
Align the start and end of I/O buffers to IOB_ZLEN, which is larger
than any cacheline size we expect to encounter.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On platforms where DMA devices are not in the same coherency domain as
the CPU cache, it is necessary to create page table entries where the
translations are marked as uncacheable.
We choose to place iPXE within the low 4GB of memory (since 32-bit DMA
devices are still reasonably common even on systems with 64-bit CPUs).
We therefore need to cover only the low 4GB of memory with these page
table entries.
Update virt_to_phys() to allow for the existence of such a mapping,
assuming that iPXE itself will always reside within the top 4GB of the
64-bit virtual address space (and therefore that the DMA mapping must
lie somewhere below this in the negative virtual address space).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use PTEs 256-259 to create a mapping of the 32-bit physical address
space with attributes suitable for coherent DMA mappings.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The page table entries for the identity map vary according to the
paging level in use, and so must be constructed within the loop used
to detect the maximum supported paging level. Other page table
entries are invariant between paging levels, and so may be constructed
just once before entering the loop.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove logic that programs the hardware to strip out VLAN from RX
packets. Do not drop packets due to VLAN mismatch and allow the upper
layer to decide whether to discard the packets.
Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
iPXE is released under the GNU GPL and is 100% open source software.
There are no "premium editions", no in-app advertisements, and no
hidden costs. The fully public version published to GitHub is and
always will be the definitive and only version of iPXE.
Many large features in iPXE have been commercially funded within this
open source model, with features being published upstream as soon as
they are complete and made available for the whole world to use, not
restricted for use only by the customer funding that particular piece
of development work.
There has not to date been any funding model for smaller pieces of
work, such as occasional code review or guaranteed attention to bug
reports. The overhead of establishing a commercial relationship is
usually too high to be worthwhile for very small units of work.
The GitHub sponsorship mechanism provides a framework for efficiently
handling small commercial requests (or individual tokens of thanks).
Add a FUNDING.yml file to provide a convenient way for anyone who
wants to support the ongoing open source development of iPXE to do so.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We no longer have any requirement for common symbols. Disable common
symbols via the -fno-common compiler option, and simplify the test for
support of -fdata-sections (which can return a false negative when
common symbols are enabled).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some legacy drivers use large static allocations for transmit and
receive buffers. To avoid bloating the .bss segment, we currently
implement these as a single common symbol named "_shared_bss" (which
is permissible since only one legacy driver may be active at any one
time).
Switch to dynamic allocation of these .bss-like segments, to avoid the
requirement for using common symbols.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently have contexts in which the local variable "nic" is a
pointer to the global variable also called "nic". This complicates
the creation of macros.
Rename the global variable to "legacy_nic" to reduce pollution of the
global namespace and to allow for the creation of macros referring to
fields within this global variable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for probing a device based on the path or alias found in
the "/chosen/stdout-path" node, and using a consequently instantiated
UART as the default serial console.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The 16550 design includes a programmable 16-bit clock divider for an
arbitrary input clock, requiring knowledge of the input clock
frequency in order to calculate the divider value for a given baud
rate. The 16550 UARTs in an x86 PC will always have a 1.8432 MHz
input clock. Non-x86 systems may have other input clock frequencies.
Define the input clock frequency as a property of a 16550 UART, and
read the value from the device tree "clock-frequency" property.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some implementations of 16550-compatible UARTs (e.g. the DesignWare
UART) are known to ignore writes to the line control register while
the transmitter is active.
Wait for the transmitter to become empty before attempting to write to
the line control register.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow the platform configuration to provide a mechanism for
identifying the serial console UART. Provide two globally available
mechanisms: "null" (i.e. no serial console), and "fixed" (i.e. use
whatever is specified by COMCONSOLE in config/serial.h).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When a native serial driver is enabled for the system console device
specified via "/chosen/stdout-path", it is very likely that this will
correspond to the same physical serial port used for the SBI debug
console.
Inhibit input and output via the SBI console whenever a serial console
is active, to avoid duplicated output characters and unpredictable
input behaviour.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE drivers have been written with the implicit assumption that MMIO
writes are allowed to be posted but that an MMIO register read or
write after another MMIO register write will always observe the
effects of the first write.
For example: after having written a byte to the transmit holding
register (THR) of a 16550 UART, it is expected that any subsequent
read of the line status register (LSR) will observe a value consistent
with the occurrence of the write.
RISC-V does not seem to provide any ordering guarantees between
accesses to different registers within the same MMIO device. Add
fences as part of the MMIO accessors to provide the assumed
guarantees.
Use "fence io, io" before each MMIO read or write to enforce full
serialisation of MMIO accesses with respect to each other. This is
almost certainly more conservative than is strictly necessary.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the generic UART driver-private data pointer, rather than
embedding the generic UART within the 16550 UART structure.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
16550 UARTs exist on non-x86 platforms but will be accessible via MMIO
rather than port I/O. It is possible to encounter MMIO-mapped 16550
UARTs on x86 platforms, but there is no real requirement to support
them in iPXE since the standard COM1, COM2, etc ports have been
present on every PC-compatible machine since 1981.
Assume for now that accessing 16550 UART registers requires
inb()/outb() on x86 and readb()/writeb() on other architectures.
Allow for the existence of a register shift on MMIO-mapped 16550
UARTs, since modern SoCs tend to treat register addresses as being
aligned to either 32-bit or 64-bit boundaries.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove the assumption that all platforms use a fixed number of 16550
UARTs identifiable by a simple numeric index. Create an abstraction
allowing for dynamic instantiation and registration of any number of
arbitrary UART models.
The common case of the serial console on x86 uses a single fixed UART
specified at compile time. Avoid unnecessarily dragging in the
dynamic instantiation code in this use case by allowing COMCONSOLE to
refer to a single static UART object representing the relevant port.
When selecting a UART by command-line argument (as used in the
"gdbstub serial <port>" command), allow the UART to be specified as
either a numeric index (to retain backwards compatiblity) or a
case-insensitive port name such as "COM2".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In the context of serial consoles, the use of any frame formats other
than the standard 8 data bits, no parity, and one stop bit is so rare
as to be nonexistent.
Remove the almost certainly unused support for custom frame formats.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The early UART is an optional feature used to obtain debug output from
the prefix before iPXE is able to parse the device tree.
Extend this feature to also cover any console output that iPXE
attempts to send to the SBI console, on the basis that the purpose of
the early UART is to provide an output-only device for situations in
which there is no functional SBI console.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The RISC-V "fence" instruction encoding includes bits for predecessor
and successor input and output operations, separate from read and
write operations. It is up to the CPU implementation to decide what
counts as I/O space rather than memory space for the purposes of this
instruction.
Since we do not expect fencing to be performance-critical, keep
everything as simple and reliable as possible by using the unadorned
"fence" instruction (equivalent to "fence iorw, iorw").
Add a memory clobber to ensure that the compiler does not reorder the
barrier. (The volatile qualifier seems to already prevent reordering
in practice, but this is not guaranteed according to the compiler
documentation.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Non-permitted name characters such as a colon are sometimes used to
separate alias names or paths from additional metadata, such as the
baud rate for a UART in the "/chosen/stdout-path" property.
Support the use of such alias names and paths by allowing any
character not permitted in a property name to terminate a property or
node name match. (This is a very relaxed matching rule that will
produce false positive matches on invalid input, but this is unlikely
to cause problems in practice.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Devices with only 32-bit DMA addressing are relatively common even on
systems with 64-bit CPUs. Limit relocation of iPXE to 32-bit address
space so that I/O buffers and other DMA allocations will be accessible
by 32-bit devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We will want to be able to create the console device as early as
possible. Refactor devicetree probing to remove the assumption that a
devicetree device must have a devicetree parent, and expose functions
to allow a standalone device to be created given only the offset of a
node within the tree.
The full device path is no longer trivial to construct with this
assumption removed. The full path is currently used only for debug
messages. Remove the stored full path, use just the node name for
debug messages, and ensure that the topology information previously
visible in the full path is reconstructible from the combined debug
output if needed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for RFC 3442 classless static routes provided via DHCP
option 121.
Originally-implemented-by: Hazel Smith <hazel.smith@leicester.ac.uk>
Originally-implemented-by: Raphael Pour <raphael.pour@hetzner.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Extend the definition of an IPv4 routing table entry to allow for the
expression of non-default gateways for specified off-link subnets, and
of on-link secondary subnets (where we can send directly to the
destination address even though our source address is not within the
subnet).
This more precise definition also allows us to correctly handle
routing in the (uncommon for iPXE) case when multiple network
interfaces are open concurrently and more than one interface has a
default gateway.
The common case of a single IPv4 address/netmask and a default gateway
now results in two routing table entries. To retain backwards
compatibility with existing documentation (and to avoid on-screen
clutter), the "route" command prints default gateways on the same line
as the locally assigned address. There is therefore no change in
output from the "route" command unless explicit additional (off-link
or on-link) routes are present.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Xuantie/T-Head processors such as the C910 (as used in the Sipeed
Lichee Pi 4A) use the high bits of the PTE in a very non-standard way
that is incompatible with the RISC-V specification.
As per the "Memory Attribute Extension (XTheadMae)", bits 62 and 61
represent cacheability and "bufferability" (write-back cacheability)
respectively. If we do not enable these bits, then the processor gets
incredibly confused at the point that paging is enabled. The symptom
is that cache lines will occasionally fail to fill, and so reads from
any address may return unrelated data from a previously read cache
line for a different address.
Work around these hardware flaws by detecting T-Head CPUs (via the
"get machine vendor ID" SBI call), then reading the vendor-specific
SXSTATUS register to determine whether or not the vendor-specific
Memory Attribute Extension has been enabled by the M-mode firmware.
If it has, then set bits 61 and 62 in each page table entry that is
used to access normal memory.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a fence between the write to the UART transmit register and the
subsequent read from the transmit status register, to ensure that the
status correctly reflects the occurrence of the write.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The RISC-V specification states that "if SATP is written with an
unsupported mode, the entire write has no effect; no fields in SATP
are modified". We currently rely on this specified behaviour when
calculating the early UART base address: if SATP has a non-zero value
then we assume that paging must be enabled.
The XuanTie C910 CPU (as used in the Lichee Pi 4A) does not conform to
this specified behaviour. Writing SATP with an unsupported mode will
leave SATP.MODE as zero (i.e. bare physical addressing) but the write
to SATP.PPN will still take effect, leaving SATP with an illegal
non-zero value.
Work around this misbehaviour by explicitly writing zero to SATP if we
detect that the mode change has not taken effect (e.g. because the CPU
does not support the requested paging mode).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently rely on the recursive nature of devicetree bus probing to
obtain the region cell size specification from the parent device.
This blocks the possibility of creating a standalone console device
based on /chosen/stdout-path before probing the whole bus.
Fix by using fdt_parent() to locate the parent device at the point of
use within dt_ioremap().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some platforms (such as the Sipeed Lichee Pi 4A) choose to make early
debugging entertainingly cumbersome for the programmer. These
platforms not only fail to provide a functional SBI debug console, but
also choose to place the UART at a physical address that cannot be
identity-mapped under the only paging model supported by the CPU.
Support such platforms by creating a virtual address mapping for the
early UART (in the 2MB megapage immediately below iPXE itself), and
using this as the UART base address whenever paging is enabled.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some platforms (such as the Sipeed Lichee Pi 4A) do not provide a
functional SBI debug console. We can obtain early debug messages on
these systems by writing directly to the UART used by the vendor
firmware.
There is no viable way to parse the UART address from the device tree,
since the prefix debug messages occur extremely early, before the C
runtime environment is available and therefore before any information
has been parsed from the device tree. The early UART model and
register addresses must be configured by editing config/serial.h if
needed. (This is an acceptable limitation, since prefix debugging is
an extremely specialised use case.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Abstract out the SBI debug console calls into macros that can be
shared between print_message and print_hex_value.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The riscv,isa devicetree property appears not to be fully populated on
some real-world systems. For example, the Sipeed Lichee Pi 4A
(running the vendor U-Boot) reports itself as "rv64imafdcvsu", which
does not include the "zicntr" extension even though the time CSR is
present and functional.
Ignore the riscv,isa property and rely solely on CSR testing to
determine whether or not extensions are present.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
With the 64-bit paging schemes (Sv39, Sv48, and Sv57), we identity-map
as much of the physical address space as is possible. Experimentation
shows that this is not sufficient to provide access to all I/O
devices. For example: the Sipeed Lichee Pi 4A includes a CPU that
supports only Sv39, but places I/O devices at the top of a 40-bit
address space.
Add support for creating I/O page table entries on demand to map I/O
devices, based on the existing design used for x86_64 BIOS.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Support debug consoles that do not automatically convert LF to CRLF by
including the CR character within the debug message strings.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide DBGC_MEMMAP() as a replacement for memmap_dump(), allowing the
colour used to match other messages within the same message group.
Retain a dedicated colour for output from memmap_dump_all(), on the
basis that it is generally most useful to visually compare full memory
dumps against previous full memory dumps.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the terminology "min" and "max" for addresses covered by a memory
region descriptor, since this is sufficiently intuitive to generally
not require further explanation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the shared initrd reshuffling and CPIO header construction code
for RISC-V bare-metal kernels. This allows for files to be injected
into the constructed ("magic") initrd image in exactly the same way as
is done for bzImage and UEFI kernels.
We append a dummy image encompassing the FDT to the end of the
reshuffle list, so that it ends up directly following the constructed
initrd in memory (but excluded from the initrd length, which was
recorded before constructing the FDT).
We also temporarily prepend the kernel binary itself to the reshuffle
list. This is guaranteed to be safe (since reshuffling is designed to
be unable to fail), and avoids the requirement for the kernel segment
to be available before reshuffling. This is useful since current
RISC-V bare-metal kernels tend to be distributed as EFI zboot images,
which require large temporary allocations from the external heap for
the intermediate images created during archive extraction.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Any initrd images that are not within the external heap (e.g. embedded
images) do not need to be copied to the external heap for reshuffling,
and can just be left in their original locations.
Ignore any images that are not already within the external heap (or,
more precisely, that are wholly outside of the reshuffle region within
the external heap) when squashing and swapping images.
This reduces the maximum additional storage required by squashing and
swapping to zero, and so ensures that the reshuffling step is
guaranteed to succeed under all circumstances. (This is unrelated to
the post-reshuffle load region check, which is still required.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide a reusable function initrd_load_all() to load all initrds
(including any constructed CPIO headers) into a contiguous memory
region, and support functions to find the constructed total length and
permissible post-reshuffling load address range.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
It is hypothetically possible for external heap memory allocated
during driver startup to have been freed before an image was
downloaded, which could therefore leave an image straddling the
address recorded as the top of the reshuffle region.
Allow for this possibility by skipping squashing for any images
already straddling (or touching) the top of the reshuffle region.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Alignment of initrd lengths is applicable to all Linux kernels, not
just those in the x86 bzImage format.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Eliminate the requirement for free space when reshuffling initrds by
swapping adjacent initrds using an in-place triple reversal.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently rely on implicit detection of the external heap region.
The INT 15 memory map mangler relies on examining the corresponding
in-use memory region, and the initrd reshuffler relies on performing a
separate detection of the largest free memory block after startup has
completed.
Replace these with explicit public symbols to describe the external
heap region.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the external heap ends up at the top of the system memory map then
leave a gap after the heap to ensure that no block ends up being
allocated with either a start or end address of zero, since this is
frequently confusing to both code and humans.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for relocation to a region at the very end of the physical
address space (where the next address wraps to zero).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the word-at-a-time variable-length memcpy() implementation when
performing an overlapping copy in the forwards direction, since this
is guaranteed to be safe and likely to be substantially faster than
the existing bytewise copy.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow a single initrd image to be passed verbatim to the booted RISC-V
kernel, as a proof of concept.
We do not yet support reshuffling to make optimal use of available
memory, or dynamic construction of CPIO headers, but this is
sufficient to allow iPXE to start up the Fedora 42 kernel with its
matching initrd image.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow an initrd location to be specified in our constructed device
tree via the "linux,initrd-start" and "linux,initrd-end" properties.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is nothing x86-specific in initrd.c, and a variant of the
reshuffling logic will be required for executing bare-metal kernels on
RISC-V and AArch64.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use image_replace() to transfer execution to the extracted image,
rather than calling image_exec() directly. This allows the original
archive image to be freed immediately if it was marked as an
automatically freeable image (e.g. via "chain --autofree").
In particular, this ensures that in the case of an archive image
containing another archive image (such as an EFI zboot kernel wrapper
image containing a gzip-compressed kernel image), the intermediate
extracted image will be freed as early as possible, since extracted
images are always marked as automatically freeable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Current RISC-V and AArch64 kernels found in the wild tend not to be in
the documented kernel format, but are instead "EFI zboot" kernels
comprising a small EFI executable that decompresses and executes the
inner payload (which is a kernel in the expected format).
The EFI zboot header includes a recognisable magic value "zimg" along
with two fields describing the offset and length of the compressed
payload. We can therefore treat this as an archive image format,
extracting the payload as-is and then relying on our existing ability
to execute compressed images.
This is sufficient to allow iPXE to execute the Fedora 42 RISC-V
kernel binary as currently published.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The RISC-V and AArch64 bare-metal kernel images share a common header
format, and require essentially the same execution environment: loaded
close to the start of RAM, entered with paging disabled, and passed a
pointer to a flattened device tree that describes the hardware and any
boot arguments.
Implement basic support for executing bare-metal RISC-V and AArch64
kernel images. The (trivial) AArch64-specific code path is untested
since we do not yet have the ability to build for any bare-metal
AArch64 platforms. Constructing and passing an initramfs image is not
yet supported.
Rename the IMAGE_BZIMAGE build configuration option to IMAGE_LKRN,
since "bzImage" is specific to x86. To retain backwards compatibility
with existing local build configurations, we leave IMAGE_BZIMAGE as
the enabled option in config/default/pcbios.h and treat IMAGE_LKRN as
a synonym for IMAGE_BZIMAGE when building for x86 BIOS.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add an implementation of umalloc() using the generalised model of a
heap, placing the external heap in the largest usable region obtained
from the system memory map.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Size-tracked pointers allocated via umalloc() have historically been
aligned to a page boundary, as have the edges of the hidden memory
region covering the external heap.
Allow the block and size-tracked pointer alignments to be specified as
heap configuration parameters.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Create a generic model of a heap as a list of free blocks with
optional methods for growing and shrinking the heap.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
All memory map users have been updated to use the new system memory
map API. Remove get_memmap() and its associated definitions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There are several places where get_memmap() is called solely to
produce debug output. Replace these with calls to memmap_dump_all()
(which will be a no-op unless debugging is enabled).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use the concept of an in-use memory region defined as part of the
system memory map API to describe the umalloc() heap.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide an implementation of the system memory map API based on the
assorted BIOS INT 15 calls, and a temporary implementation of the
legacy get_memmap() function using the new API.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide an implementation of the system memory map API based on the
system device tree, excluding any memory outside the size of the
accessible physical address space and defining an in-use region to
cover the relocated copy of iPXE and the system device tree.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Define a generic system memory map API, based on the abstraction
created for parsing the FDT memory map and adding a concept of hidden
in-use memory regions as required to support patching the BIOS INT 15
memory map.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The size of accessible physical address space will be required for the
runtime memory map, not just at relocation time. Make this size an
additional parameter to fdt_register() (matching the prototype for
fdt_relocate()), and record the value for future reference.
Note that we cannot simply store the limit in fdt_relocate() since it
is called before .data is writable and before .bss is zeroed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Create namespace for an architecture-independent memmap.c by renaming
the BIOS-specific memmap.c to int15.c.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Replace malloc_phys with dma_alloc, free_phys with dma_free, alloc_iob
with alloc_rx_iob, free_iob with free_rx_iob, virt_to_bus with dma or
iob_dma. Replace dma_addr_t with physaddr_t.
Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
Return the proper error codes in bnxt_init_one, to indicate the
correct return status upon completion. Failure paths could
incorrectly indicate a success. Correct assertion condition to check
for non-NULL pointer.
Signed-off-by: Joseph Wong <joseph.wong@broadcom.com>
Coverity reports a spurious potential null pointer dereference in
cms_decrypt(), since the null pointer check takes place after the
pointer has already been dereferenced. The pointer can never be null,
since it is initialised to point to cipher_null at the point that the
containing structure is allocated.
Remove the redundant null pointer check, and for symmetry ensure that
the digest and public-key algorithm pointers are similarly initialised
at the point of allocation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
QEMU's -pflash option requires an image that has been padded to the
exact expected size (32MB for all of the supported RISC-V virtual
machines).
Add a .pf32 build target which is simply the equivalent .sbi target
padded to 32MB in size, to simplify testing.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If paging is not supported, then we will attempt to apply dynamic
relocations to fix up the runtime addresses. If the image is
currently executing directly from flash memory, this can result in
effectively sending an undefined sequence of commands to the flash
device, which can cause unwanted side effects.
Perform an explicit writability test before applying relocations,
using a write value chosen to be safe for at least any devices
conforming to the JEDEC Common Flash Interface (CFI01).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We do not currently describe the temporary page table or the temporary
stack as areas to be avoided during relocation of the iPXE image to a
new physical address.
Perform the copy of the iPXE image and zeroing of the .bss within
libprefix.S, after we have no futher use for the temporary page table
or the temporary initial stack. Perform the copy and registration of
the system device tree in C code after relocation is complete and the
new stack (within .bss) has been set up.
This provides a clean separation of responsibilities between the
RISC-V libprefix.S and the architecture-independent fdtmem.c. The
prefix is responsible only for relocating iPXE to the new physical
address returned from fdtmem_relocate(), and doesn't need to know or
care where fdtmem.c is planning to place the copy of the device tree.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On x86 BIOS, it has been useful to be able to build iPXE to resemble a
Linux kernel, so that it can be loaded by programs such as syslinux
which already know how to handle Linux kernel binaries.
Add an equivalent .lkrn build target for RISC-V SBI, allowing for
build targets such as:
make bin-riscv64/ipxe.lkrn
make bin-riscv64/cgem.lkrn
The Linux kernel header format allows us to specify a required length
(including uninitialised-data portions) and defines that the image
will be loaded at a fixed offset from the start of RAM. We can
therefore use known-safe areas of memory (within our own .bss) for the
initial temporary page table and stack.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On startup, we may be running from read-only memory. We need to parse
the devicetree to obtain the system memory map, and identify a safe
location to which we can copy our own binary image along with a
stashed copy of the devicetree, and then transfer execution to this
new location.
Parsing the system memory map realistically requires running C code.
This in turn requires a small temporary stack, and some way to ensure
that symbol references are valid.
We first attempt to enable paging, to make the runtime virtual
addresses equal to the link-time virtual addresses. If this fails,
then we attempt to apply the compressed relocation records.
Assuming that one of these has worked (i.e. that either the CPU
supports paging or that our image started execution in writable
memory), then we call fdtmem_relocate() to parse the system memory map
to find a suitable relocation target address.
After the copy we disable paging, jump to the relocated copy,
re-enable paging, and reapply relocation records (if needed). At this
point, we have a full runtime environment, and can transfer control to
normal C code.
Provide this functionality as part of libprefix.S, since it is likely
to be shared by multiple prefixes.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Always construct the page tables based on the link-time address values
even if relocations have already been applied, on the assumption that
relocations will be reapplied after paging has been enabled.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The address of the compressed relocation records is currently
calculated implicitly relative to the program counter. This requires
the relocation records to be copied as part of relocation to a new
physical address, so that they can be reapplied (if needed) after
copying iPXE to the new physical address.
Since the relocation destination will never overlap the original iPXE
image, and since the relocation records will not be needed further
after completing relocation, we can avoid the need to copy the records
by passing in a pointer to the relocation records present in the
original iPXE image.
Pass the compressed relocation record address as an explicit parameter
to apply_relocs(), rather than being implicit in the program counter.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Relocation requires knowledge of the size of the accessible physical
address space, which for 64-bit CPUs will vary according to the paging
level supported by the processor.
Update enable_paging_64() and enable_paging_32() to calculate and
return the size of the accessible physical address space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add code to parse the devicetree memory nodes, memory reservations
block, and reserved memory nodes to construct an ordered and
non-overlapping description of the system memory map, and use this to
identify a suitable address to which iPXE may be relocated at runtime.
We choose to place iPXE on a superpage boundary (as required by the
paging code), and to use the highest available address within
accessible memory. This mirrors the approach taken for x86 BIOS
builds, where we have long assumed that any image format that we might
need to support may require specific fixed addresses towards the
bottom of the memory map, but is very unlikely to require specific
fixed addresses towards the top of the memory map (since those
addresses may not exist, depending on the amount of installed RAM).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Ensure that the prefix_virt dynamic relocation ends up on a suitably
aligned boundary for a compressed relocation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE does not make use of any thread-local storage. Use the otherwise
unused thread pointer register ("tp") to hold the current value of
the virtual address offset, rather than using a global variable.
This ensures that virt_offset can be made valid even during very early
initialisation (when iPXE may be executing directly from read-only
memory and so cannot update a global variable).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The "reg" property is also used by non-device nodes, such as the nodes
describing the system memory map.
Provide generalised functionality for parsing the "#address-cells",
"#size-cells", and "reg" properties.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The pattern of "load address to register" followed by "load value from
address in register" generally results in three instructions: two to
load the address and one to load the value.
This can be reduced to two instructions by allowing the assembler to
incorporate the low bits of the address within the load (or store)
instruction itself. In the case of a store, this requires specifying
a second register that can be temporarily used to hold the high bits
of the address. (In the case of a load, the destination register is
reused for this purpose.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In a position-dependent executable, where all addresses are fixed
at link time, we can use the standard technique as documented by
GNU ld to get the value of an absolute symbol, e.g.:
extern char _my_symbol[];
printf ( "Absolute symbol value is %x\n", ( ( int ) _my_symbol ) );
This technique may not work in a position-independent executable.
When dynamic relocations are applied, the runtime addresses will no
longer be equal to the link-time addresses. If the code to obtain the
address of _my_symbol uses PC-relative addressing, then it will
calculate the runtime "address" of the absolute symbol, which will no
longer be equal the the link-time "address" (i.e. the correct value)
of the absolute symbol.
Define macros ABS_SYMBOL(), ABS_VALUE_INIT(), and ABS_VALUE() that
provide access to the correct values of absolute symbols even in
position-independent code, and use these macros wherever absolute
symbols are accessed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
During early initialisation on some platforms, the .data and .bss
sections may not yet be writable.
Display the assertion message before attempting to increment the
assertion failure counter, since writing to the assertion counter may
trigger a CPU exception that ends up resetting the system.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Once paging has been enabled, there is no direct way to determine the
virtual address offset without external knowledge. (The paging mode,
if needed, can be read directly from the SATP CSR.)
Change the return value from enable_paging() to provide the virtual
address offset.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the virtual address offset is precisely one page (i.e. each virtual
address maps to a physical address one page higher), and if the 32-bit
transition code happens to end up at the end of a page (which would
require an unrealistic 2MB of content in .prefix), then it would be
possible for the program counter to cross into the portion of the
virtual address space still borrowed for use as the temporary physical
map.
Avoid this remote possibility by moving the restoration of the
temporarily modified PTE within the transition code block (which is
guaranteed to remain within a single page since it is aligned on its
own size).
This unfortunately requires increasing the alignment of the transition
code (and hence the maximum number of NOPs inserted). The assembler
syntax theoretically allows us to avoid inserting any NOPs via a
directive such as:
.balign PAGE_SIZE, , enable_paging_32_max_len
(i.e. relying on the fact that if the transition code is already
sufficiently far away from the end of a page, then no padding needs to
be inserted). However, alignment on RISC-V is implemented using the
R_RISCV_ALIGN relaxing relocation, which doesn't encode any concept of
a maximum padding length, and so the maximum padding length value is
effectively ignored.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The virtual offset memory model used for i386-pcbios and x86_64-pcbios
can be generalised to also cover riscv32-sbi and riscv64-sbi. In both
architectures, the 32-bit builds will use a circular map of the 32-bit
address space, and the 64-bit builds will use an identity map for the
relevant portion of the physical address space, with iPXE itself
placed in the negative (kernel) address space.
Generalise and document the virt_offset mechanism, and set it as the
default for both PCBIOS and SBI platforms.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Inline assembly using PHYS_CODE() or REAL_CODE() must use the "R"
constraint rather than the "r" constraint to ensure that the compiler
chooses registers that will be valid for the 32-bit or 16-bit assembly
code fragment.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add millicode routines to print hexadecimal values (with any number of
digits), and macros to print register contents or symbol addresses.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RISC-V has a millicode calling convention that allows for the use of
an alternative link register x5/t0. With sufficient care, this allows
for two levels of subroutine call even when no stack is available.
Provide both standard and millicode entry points for print_message(),
and use the millicode entry point to allow for printing debug messages
from libprefix.S itself.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Create a prefix library function print_message() to print text to the
SBI debug console. Use the "write byte" SBI call (rather than "write
string") so that the function remains usable even after enabling
paging.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The GNU assembler does not seem to automatically assume alignment to
an instruction boundary for sections containing assembled code.
Place the prefix debug strings (if present) in .rodata rather than in
.prefix, to avoid potentially creating misaligned code sections.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use compressed relocation records instead of raw Elf_Rela records.
This saves around 15% of the total binary size for the all-drivers
image bin-riscv64/ipxe.sbi.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Even though we build with -mno-plt, redundant .got and .got.plt
sections are still generated.
Include these redundant sections within .data (which has identical
section attributes) to simplify the section list.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The ELF hash table is generated when building a position-independent
executable even though it is not required (since we have no dynamic
linker).
Explicitly discard these unneeded sections.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Define a new "ZREL" compressor information block, describing a block
of Elf_Rel or Elf_Rela runtime relocations to be converted to an
iPXE-specific compressed relocation format.
The compressed relocation format is based loosely on the Elf_Relr
bitmap+offset format, with some optimisations for use in iPXE. In
particular:
- a relative "skip" value is used instead of an absolute offset
- the width of the skip value is reduced to 19 bits (when present)
- an explicit skip value of zero is used to terminate the list
- unaligned relocations are prohibited
The layout of bits within the compressed relocation record is also
adjusted to make assembly code implementations simpler: the skip flag
bit is placed in the MSB so that it can be tested using "bltz" or
similar instructions, and the skip value is placed above the
relocation flag bits so that a typical shifting implementation will
naturally end up with a zero value in its accumulator if and only if
the record was a terminator.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Parsing ELF data is simpler if we don't have to build a single binary
to handle both 32-bit and 64-bit ELF formats.
Allow for separate 32-bit and 64-bit binaries built from util/zbin.c
(as is already done for util/elf2efi.c).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add code to construct a 32-bit page table to map the whole of the
32-bit address space with a fixed offset selected to map iPXE itself
at its link-time address, and to return with paging enabled and the
program counter updated to a virtual address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Paging provides an alternative to using relocations: instead of
applying relocation fixups to the runtime addresses, we can set up
virtual addressing so that the runtime addresses match the link-time
addresses.
This opens up the possibility of running portions of iPXE directly
from read-only memory (such as a memory-mapped flash device), subject
to the caveats that .data is not yet writable and .bss is not yet
zeroed. This should allow us to run enough code to parse the memory
map from the FDT, identify a suitable RAM block, and physically
relocate ourselves there.
Add code to construct a 64-bit page table (in a single 4kB buffer) to
identity-map as much of the physical address space as possible, to map
iPXE itself at its link-time address, and to return with paging
enabled and the program counter updated to a virtual address. We use
the highest paging level supported by the CPU, to maximise the amount
of the physical address space covered by the identity map.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Using paging (rather than relocation records) will be easier on 64-bit
RISC-V if we place iPXE within the negative (kernel) virtual address
space.
Allow the link-time address to be non-zero and to vary between 32-bit
and 64-bit builds. Choose addresses that are expected to be amenable
to the use of paging.
There is no particular need to use a non-zero address in the 32-bit
builds, but doing so allows us to validate that the relocation code is
handling this case correctly.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Split out the runtime relocation logic from sbiprefix.S to a new
library libprefix.S.
Since this logically decouples the process of runtime relocation from
the _sbi_start symbol (currently used to determine the base address
for applying relocations), provide an alternative mechanism for the
relocator to determine the base address.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove the last remaining traces of the concept of a user pointer,
leaving iPXE with a simpler and cleaner memory model that implicitly
assumes that all memory locations can be reached through pointer
dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The uaccess.h header is no longer required for any code that touches
external ("user") memory, since such memory accesses are now performed
through pointer dereferences. Reduce the number of files including
this header.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Almost all image consumers do not need to modify the content of the
image. Now that the image data is a pointer type (rather than the
opaque userptr_t type), we can rely on the compiler to enforce this at
build time.
Change the .data field to be a const pointer, so that the compiler can
verify that image consumers do not modify the image content. Provide
a transparent .rwdata field for consumers who have a legitimate (and
now explicit) reason to modify the image content.
We do not attempt to impose any runtime restriction on checking
whether or not an image is writable. The only existing instances of
genuinely read-only images are the various unit test images, and it is
acceptable for defective test cases to result in a segfault rather
than a runtime error.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Not all images are allocated via alloc_image(). For example: embedded
images, the static images created to hold a runtime command line, and
the images used by unit tests are all static structures.
Using image_set_cmdline() (via e.g. the "imgargs" command) to set the
command-line arguments of a static image will succeed but will leak
memory, since nothing will ever free the allocated command line.
There are no code paths that can lead to calling image_set_len() on a
static image, but there is no safety check against future code paths
attempting this.
Define a flag IMAGE_STATIC to mark an image as statically allocated,
generalise free_image() to also handle freeing dynamically allocated
portions of static images (such as the command line), and expose
free_image() for use by static images.
Define a related flag IMAGE_STATIC_NAME to mark the name as statically
allocated. Allow a statically allocated name to be replaced with a
dynamically allocated name since this is a potentially valid use case
(e.g. if "imgdecrypt --name <name>" is used on an embedded image).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Decrypting a CMS-encrypted image will overwrite the existing image
data in place, and using an encrypted embedded image is a valid use
case.
Move embedded images from .rodata to .data to reflect the fact that
they are intended to be writable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If an embedded script uses "chain --replace", the embedded image will
retain a reference to the replacement image in perpetuity.
Fix by clearing any recorded replacement image immediately in
image_exec(), instead of relying upon image_free() to drop the
reference.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The BOFM tests are not part of the standard unit test suite, since
they are designed to allow for exercising real BOFM driver code
outside of the context of a real IBM blade server.
Allow for the BOFM tests to be run without a real BOFM driver, by
providing a dummy driver for the specified PCI test device.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The peerdist_msg_blk() macro seems to have been introduced in the
original commit that added pccrr.h, but this macro was never used by
the version of the code present in that commit.
Remove this unused macro and the corresponding nonexistent external
function declaration.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Since all data transfer buffer contents are now accessible via direct
pointer dereferences, remove the unnecessary abstractions for read and
write operations and create two new data transfer buffer types: a
fixed-size buffer, and a void buffer that records its size but can
never receive non-zero lengths of data. These replace the custom data
buffer types currently implemented for EFI PXE TFTP downloads and for
block device translations.
A new operation xferbuf_detach() is required to take ownership of the
data accumulated in the data transfer buffer, since we no longer rely
on the existence of an independently owned external data pointer for
data transfer buffers allocated via umalloc().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify cmdline_init() by assuming that the externally provided
command line is directly accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify bzImage parsing by assuming that the various headers are
directly accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit ef03849 ("[uaccess] Remove redundant userptr_add() and
userptr_diff()") exposed a signedness bug in the comparison of initrd
locations, since the expression (initrd->data - current) was
effectively no longer coerced to a signed type.
In particular, the common case will be that the top of the initrd
region is the start of the iPXE .textdata region, which has virtual
address zero. This causes initrd->data to compare as being above the
top of the initrd region for all images, when this bug would
previously have been limited to affecting only initrds placed 2GB or
more below the start of .textdata.
Fix by using physical addresses for all comparisons on initrd
locations.
Reported-by: Sven Dreyer <sven@dreyer-net.de>
Reported-by: Harald Jensås <hjensas@redhat.com>
Reported-by: Jan ONDREJ (SAL) <ondrejj@salstar.sk>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the ability to reboot to the firmware setup menu (if supported) by
setting the relevant value in the OsIndications variable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for the possibility of additional reboot types by extending the
reboot() function to use a flags bitmask rather than a single flag.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify Multiboot and ELF image parsing by assuming that the
Multiboot and ELF headers are directly accessible via pointer
dereferences, and add some missing header validations.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
GCC 15 generates a warning when a string initializer is too large to
allow for a trailing NUL terminator byte. This type of initializer is
fairly common in signature strings such as ACPI table identifiers.
Fix by disabling the warning.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The legacy NIC drivers do not consistently take a second parameter in
their disable function. We currently use an unsafe function wrapper
that declares no parameters, and rely on the ABI allowing a second
parameter to be silently ignored if not expected by the caller. As of
GCC 15, this hack results in an incompatible pointer type warning.
Fix by removing the hack, and instead updating all relevant legacy NIC
drivers to take an unused second parameter in their disable function.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
GCC 15 defaults to C23, which reserves bool, true, and false as
keywords. Avoid using these as parameter or variable names.
Modified-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Scrolling currently involves redrawing every character cell, which can
be frustratingly slow on large framebuffer consoles. Accelerate this
operation by skipping the redraw for any unchanged character cells.
In the common case that large areas of the screen contain whitespace,
this optimises away the vast majority of the redrawing operations.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the framebuffer console drivers by assuming that the raw
framebuffer, character cell array, background picture, and glyph data
are all directly accessible via pointer dereferences.
In particular, this avoids the need to copy each glyph during drawing:
the VESA framebuffer driver can simply return a pointer to the glyph
data stored in the video ROM.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the PXE file API implementation by assuming that all string
buffers are directly accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the PXE API call dispatcher code by assuming that the PXE
parameter block is accessible via a direct pointer dereference. This
avoids the need for the API call dispatcher to know the size of the
parameter block.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the implementation of the "digest" command by assuming that
the entire image data can be passed directly to digest_update().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the block device code by assuming that all read/write buffers
are directly accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify microcode image parsing by assuming that all image content is
directly accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the microcode update mechanism by assuming that status
reports are accessible via direct pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Use standard void pointers for umalloc(), urealloc(), and ufree(),
with the "u" prefix retained to indicate that these allocations are
made from external ("user") memory rather than from the internal heap.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the SMBIOS structure parsing code by assuming that all
structure content is fully accessible via pointer dereferences.
In particular, this allows the convoluted find_smbios_structure() and
read_smbios_structure() to be combined into a single function
smbios_structure() that just returns a direct pointer to the SMBIOS
structure, with smbios_string() similarly now returning a direct
pointer to the relevant string.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the ACPI table parsing code by assuming that all table
content is fully accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the deflate, zlib, and gzip decompression code by assuming
that all content is fully accessible via pointer dereferences.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the CMS code by assuming that all content is fully accessible
via pointer dereferences. This avoids the need to use fragment loops
for calculating digests and decrypting (or reencrypting) data.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the ASN.1 code by assuming that all objects are fully
accessible via pointer dereferences. This allows the concept of
"additional data beyond the end of the cursor" to be removed, and
simplifies parsing of all ASN.1 image formats.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove the intermediate concept of a user pointer from real address
conversion, leaving real_to_virt() as the directly implemented
function.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove the intermediate concept of a user pointer from physical
address conversions, leaving virt_to_phys() and phys_to_virt() as the
directly implemented functions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The user_to_virt() function is now a straightforward wrapper around
addition, with the addend almost invariably being zero.
Remove this redundant wrapper.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The memcpy_user(), memmove_user(), memcmp_user(), memset_user(), and
strlen_user() functions are now just straightforward wrappers around
the corresponding standard library functions.
Remove these redundant wrappers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The userptr_add() and userptr_diff() functions are now just
straightforward wrappers around addition and subtraction.
Remove these redundant wrappers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The original motivation for the userptr_t type was to be able to
support a pure 16-bit real-mode memory model in which a segment:offset
value could be encoded as an unsigned long, with corresponding
copy_from_user() and copy_to_user() functions used to perform
real-mode segmented memory accesses.
Since this memory model was first created almost twenty years ago, no
serious effort has been made to support a pure 16-bit mode of
operation for iPXE. The constraints imposed by the memory model are
becoming increasingly cumbersome to work within: for example, the
parsing of devicetree structures is hugely simplified by being able to
use and return direct pointers to the names and property values. The
devicetree code therefore relies upon virt_to_user(), which is
nominally illegal under the userptr_t memory model.
Drop support for the concept of a memory location that cannot be
reached through a straightforward pointer dereference, by redefining
userptr_t to be a simple pointer type.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Clarify the intended usage of userptr_sub() by renaming it to
userptr_diff() (to avoid confusion with userptr_add()), and fix the
existing call sites that erroneously use userptr_sub() to subtract an
offset from a userptr_t value.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For platforms with no real-time clock (such as RISC-V SBI) we use the
null time source, which currently just returns a constant zero.
Switch to using currticks() to provide a clock that does not represent
the real current time, but does at least advance at approximately the
correct rate. In conjunction with the "ntp" command, this allows
these platforms to use time-dependent features such as X.509
certificate verification for HTTPS connections.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a basic driver for the Cadence GEM network interface as emulated
by QEMU when using the RISC-V "sifive_u" machine type.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UEFI model for wireless network configuration is somewhat
underdefined. At the time of writing, the EDK2 "UEFI WiFi Connection
Manager" driver provides only one way to configure wireless network
credentials, which is to enter them interactively via an HII form.
Credentials are not stored (or exposed via any protocol interface),
and so any temporary disconnection from the wireless network will
inevitably leave the interface in an unusable state that cannot be
recovered without user intervention.
Experimentation shows that at least some wireless network drivers
(observed with an HP Elitebook 840 G10) will disconnect from the
wireless network when the SNP Shutdown() method is called, or if the
device is not polled sufficiently frequently to maintain its
association to the network. We therefore inhibit calls to Shutdown()
and Stop() for any such SNP protocol interfaces, and mark our network
device as insomniac so that it will be polled even when closed.
Note that we need to inhibit not only our own calls to Shutdown() and
Stop(), but also those that will be attempted by MnpDxe when we
disconnect it from the SNP handle. We do this by patching the
installed SNP protocol interface structure to modify the Shutdown()
and Stop() method pointers, which is ugly but unavoidable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some network devices (observed with the SNP interface to the wireless
network card on an HP Elitebook 840 G10) will stop working if they are
left for too long without being polled.
Add the concept of an insomniac network device, that must continue to
be polled even when closed.
Note that drivers are already permitted to call netdev_rx() et al even
when closed: this will already be happening for USB devices since
polling operates at the level of the whole USB bus, rather than at the
level of individual USB devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for greater control over the process used to disconnect existing
drivers from a device handle, by converting the "exclude" field from a
simple protocol GUID to a per-driver method.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Devicetree devices encode register address ranges within the "reg"
property, with the number of cells used for addresses and for sizes
determined by the #address-cells and #size-cells properties of the
immediate parent device.
Record the number of address and size cells for each device, and
provide a dt_ioremap() function to allow drivers to map a specified
range without having to directly handle the "reg" property.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add fdt_cells() to read scalar values encoded within a cell array,
reimplement fdt_u64() as a wrapper around this, and add fdt_u32() for
completeness.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently disable all external trust sources (such as the UEFI
TlsCaCertificate variable) if an explicit TRUST=... parameter is
provided on the build command line.
Define an explicit TRUST_EXT build parameter that can be used to
explicitly disable external trust sources even if no TRUST=...
parameter is provided, or to explicitly enable external trust sources
even if an explicit TRUST=... parameter is provided. For example:
# Default trusted root certificate, disable external sources
make TRUST_EXT=0
# Explicit trusted root certificate, enable external sources
make TRUST=custom.crt TRUST_EXT=1
If no TRUST_EXT parameter is specified, then continue to default to
disabling external trust sources if an explicit TRUST=... parameter is
provided, to maintain backwards compatibility with existing build
command lines.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a basic model for devices instantiated by parsing the system
flattened device tree, with drivers matched via the "compatible"
property for any non-root node.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Refactor device tree traversal to operate on the basis of describing
the token at a given offset, with no separate notion of a device tree
cursor.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Using fdt_path() to find the root node "/" currently fails, since it
will attempt to find a child node with the empty name "" within the
root node.
Fix by changing fdt_path() to ignore any trailing slashes in a device
tree path.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Version 3.0.0 of python-asn1 has a serious defect that causes it to
generate invalid DER.
Fix by switching to the asn1crypto module, which also allows for
simpler code to be used.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When creating a device tree to pass to a booted operating system,
ensure that the "chosen" node exists, and populate the "bootargs"
property with the image command line.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The allocation of memory for the certificate chain link may cause the
certificate itself to be freed by the cache discarder, if the only
current reference to the certificate is held by the certificate store
and the system runs out of memory during the call to malloc().
Ensure that this cannot happen by taking out a temporary additional
reference to the certificate within x509_append(), rather than
requiring the caller to do so.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Large transmitted records may arise if we have long client certificate
chains or if a client sends a large block of data (such as a large
HTTP POST payload). Fragment records as needed to comply with the
value that we advertise via the max_fragment_length extension.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RFC5246 states that "a client MAY send no certificates if it does not
have an appropriate certificate to send in response to the server's
authentication request". This use case may arise when the server is
using optional client certificate verification and iPXE has not been
provided with a client certificate to use.
Treat the absence of a suitable client certificate as a non-fatal
condition and send a Certificate message containing no certificates as
permitted by RFC5246.
Reported-by: Alexandre Ravey <alexandre@voilab.ch>
Originally-implemented-by: Alexandre Ravey <alexandre@voilab.ch>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Without any explicit alignment requirement, we will currently allocate
I/O buffers on their own size rounded up to the nearest power of two.
This is done to simplify driver transmit code paths, which can assume
that a standard Ethernet frame lies within a single physical page and
therefore does not need to be split even for devices with DMA engines
that cannot cross page boundaries.
Limit this automatic alignment to a maximum of the page size, to avoid
requiring excessive alignment for unusually large buffers (such as a
buffer allocated for an HTTP POST with a large parameter list).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide a custom xfer_alloc_iob() handler to ensure that transmit I/O
buffers contain sufficient headroom for the TLS record header and
record initialisation vector, and sufficient tailroom for the MAC,
block cipher padding, and authentication tag. This allows us to use
in-place encryption for the actual data within the I/O buffer, which
essentially halves the amount of memory that needs to be allocated for
a TLS data transmission.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Datagram sockets such as UDP, ICMP, and fibre channel tend to provide
a custom xfer_alloc_iob() handler to ensure that transmit I/O buffers
contain sufficient headroom to accommodate any required protocol
headers.
Stream sockets such as TCP and TLS do not typically provide a custom
xfer_alloc_iob() handler at present. The default handler simply calls
alloc_iob(), and so stream socket consumers can therefore get away
with using alloc_iob() rather than xfer_alloc_iob().
Fix the HTTP and ONC RPC protocols to use xfer_alloc_iob() where
relevant, in order to operate correctly if the underlying stream
socket chooses to provide a custom xfer_alloc_iob() handler.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Legacy ISA device probing involves poking at various I/O addresses to
guess whether or not a particular device is present.
Actual legacy ISA cards are essentially nonexistent by now, but the
probed I/O addresses have a habit of being reused for various
OEM-specific functions. This can cause some very undesirable side
effects. For example, probing for the "ne2k_isa" driver on an HP
Elitebook 840 G10 will cause the system to lock up in a way that
requires two cold reboots to recover.
Enable ISA_PROBE_ONLY in config/isa.h by default. This limits ISA
probing to use only the addresses specified in ISA_PROBE_ADDRS, which
is empty by default, and so effectively disables ISA probing. The
vanishingly small number of users who require ISA probing can simply
adjust this configuration in config/local/isa.h.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The executed image may call DisconnectController() to remove our
network device. This will leave the net device unregistered but not
yet freed (since our installed PXE base code protocol retains a
reference to the net device).
Unregistration will cause the network upper-layer driver removal
functions to be called, which will free the SNP device structure.
When the image returns from StartImage(), the snpdev pointer may
therefore no longer be valid.
The SNP device structure is not reference counted, and so we cannot
simply take out a reference to ensure that it remains valid across the
call to StartImage(). However, the code path following the call to
StartImage() doesn't actually require the SNP device pointer, only the
EFI device handle.
Store the device handle in a local variable and ensure that snpdev is
invalidated before the call to StartImage() so that future code cannot
accidentally reintroduce this issue.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
UEFI does not provide a direct method to disconnect the existing
driver of a specific protocol from a handle. We currently use
DisconnectController() to remove all drivers from a handle that we
want to drive ourselves, and then rely on recursion in the call to
ConnectController() to reconnect any drivers that did not need to be
disconnected in the first place.
Experience shows that OEMs tend not to ever test the disconnection
code paths in their UEFI drivers, and it is common to find drivers
that refuse to disconnect, fail to close opened handles, fail to
function correctly after reconnection, or lock up the entire system.
Implement a more selective form of disconnection, in which we use
OpenProtocolInformation() to identify the driver associated with a
specific protocol, and then disconnect only that driver.
Perform disconnections in reverse order of attachment priority, since
this is the order likely to minimise the number of cascaded implicit
disconnections.
This allows our MNP driver to avoid performing any disconnections at
all, since it does not require exclusive access to the MNP protocol.
It also avoids performing unnecessary disconnections and reconnections
of unrelated drivers such as the "UEFI WiFi Connection Manager" that
attaches to wireless network interfaces in order to manage wireless
network associations.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Define an ordering for internal EFI drivers on the basis of how close
the driver is to the hardware, and attempt to start drivers in this
order.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
UEFI assumes in several places that an image installs only a single
driver binding protocol instance, and that this is installed on the
image handle itself. We therefore provide a single driver binding
protocol instance, which delegates to the various internal drivers
(for EFI_PCI_IO_PROTOCOL, EFI_USB_IO_PROTOCOL, etc) as appropriate.
The debug messages produced by our Supported() method can end up
slightly misleading, since they will report only the first internal
driver that claims support for a device. In the common case of the
all-drivers build, there may be multiple drivers that claim support
for the same handle: for example, the PCI, NII, SNP, and MNP drivers
are all likely to initially find the protocols that they need on the
same device handle.
Report all internal drivers that claim support for a device, to avoid
confusing debug messages.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Return success if asked to stop driving a device that we are not
currently driving. This avoids propagating spurious errors to an
external caller of DisconnectController().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If we have a device tree available (e.g. because the user has
explicitly downloaded a device tree using the "fdt" command), then
provide it to the booted operating system as an EFI configuration
table.
Since x86 does not typically use device trees, we create weak symbols
for efi_fdt_install() and efi_fdt_uninstall() to avoid dragging FDT
support into all x86 UEFI binaries.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide fdt_create() to create a device tree to be passed to a booted
operating system. The device tree will be created from the FDT image
(if present), falling back to the system device tree (if present).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
EFI configuration tables may be freed at any time, and there is no way
to be notified when the table becomes invalidated. Create a copy of
the system flattened device tree (if present), so that we do not risk
being left with an invalid pointer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for parsing device trees where an external factor (such as a
downloaded image length) determines the maximum length, which must be
validated against the length within the device tree header.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When running on a platform that uses FDT as its hardware description
mechanism, we are likely to have multiple device tree structures. At
a minimum, there will be the device tree passed to us from the
previous boot stage (e.g. OpenSBI), and the device tree that we
construct to be passed to the booted operating system.
Update the internal FDT API to include an FDT pointer in all function
parameter lists.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow a Flattened Device Tree blob (DTB) to be provided to a booted
operating system using a script such as:
#!ipxe
kernel /images/vmlinuz console=ttyAMA0
initrd /images/initrd.img
fdt /images/rk3566-radxa-zero-3e.dtb
boot
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Define the concept of an "FDT" image, representing a Flattened Device
Tree blob that has been downloaded in order to be provided to a kernel
or other executable image. FDT images are represented using an image
tag (as with other special-purpose images such as the UEFI shim), and
are similarly marked as hidden so that they will not be included in a
generated magic initrd or show up in a virtual filesystem directory
listing.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
If the UNDI interrupt remains constantly asserted (e.g. because the
BIOS has enabled interrupts for an unrelated device sharing the same
IRQ, or because of bugs in the OEM UNDI driver), then we may get stuck
in an interrupt storm.
We cannot safely chain to the previous interrupt handler (which could
plausibly handle an unrelated device interrupt) since there is no
well-defined behaviour for previous interrupt handlers. We have
observed BIOSes to provide default interrupt handlers that variously
do nothing, send EOI, disable the IRQ, or crash the system.
Fix by disabling the UNDI interrupt whenever our handler is triggered,
and rearm it as needed when polling the network device. This ensures
that forward progress continues to be made even if something causes
the interrupt to be constantly asserted.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When using the undionly.kkpxe binary (which is never recommended), the
UNDI interrupt may still be enabled when iPXE starts up. If the PXE
base code interrupt handler is not well-behaved, this can result in
undefined behaviour when interrupts are first enabled (e.g. for
entropy gathering, or for allowing the timer tick to occur).
Fix by detecting and disabling the UNDI interrupt during the prefix
code.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The implementation of PXENV_UNDI_GET_NIC_TYPE in some PXE ROMs
(observed with an Intel X710 ROM in a Dell PowerEdge R6515) will fail
to write the NicType byte, leaving it uninitialised.
Prepopulate the NicType byte with a highly unlikely value as a
sentinel to allow us to detect this, and assume that any such devices
are overwhelmingly likely to be PCI devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Provide wrapper macros to allow efi_open() and related functions to
accept a pointer to any pointer type as the "interface" argument, in
order to allow a substantial amount of type adjustment boilerplate to
be removed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
It is now simpler to use efi_open() than to use HandleProtocol() to
obtain an ephemeral protocol instance. Remove all remaining uses of
HandleProtocol() to simplify the code.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UEFI model for opening and closing protocols is broken by design
and cannot be repaired.
Calling OpenProtocol() to obtain a protocol interface pointer does
not, in general, provide any guarantees about the lifetime of that
pointer. It is theoretically possible that the pointer has already
become invalid by the time that OpenProtocol() returns the pointer to
its caller. (This can happen when a USB device is physically removed,
for example.)
Various UEFI design flaws make it occasionally necessary to hold on to
a protocol interface pointer despite the total lack of guarantees that
the pointer will remain valid.
The UEFI driver model overloads the semantics of OpenProtocol() to
accommodate the use cases of recording a driver attachment (which is
modelled as opening a protocol with EFI_OPEN_PROTOCOL_BY_DRIVER
attributes) and recording the existence of a related child controller
(which is modelled as opening a protocol with
EFI_OPEN_PROTOCOL_BY_CHILD_CONTROLLER attributes).
The parameters defined for CloseProtocol() are not sufficient to allow
the implementation to precisely identify the matching call to
OpenProtocol(). While the UEFI model appears to allow for matched
open and close pairs, this is merely an illusion. Calling
CloseProtocol() will delete *all* matching records in the protocol
open information tables.
Since the parameters defined for CloseProtocol() do not include the
attributes passed to OpenProtocol(), this means that a matched
open/close pair using EFI_OPEN_PROTOCOL_GET_PROTOCOL can inadvertently
end up deleting the record that defines a driver attachment or the
existence of a child controller. This in turn can cause some very
unexpected side effects, such as allowing other UEFI drivers to start
controlling hardware to which iPXE believes it has exclusive access.
This rarely ends well.
To prevent this kind of inadvertent deletion, we establish a
convention for four different types of protocol opening:
- ephemeral opens: always opened with ControllerHandle = NULL
- unsafe opens: always opened with ControllerHandle = AgentHandle
- by-driver opens: always opened with ControllerHandle = Handle
- by-child opens: always opened with ControllerHandle != Handle
This convention ensures that the four types of open never overlap
within the set of parameters defined for CloseProtocol(), and so a
close of one type cannot inadvertently delete the record corresponding
to a different type.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In preparation for formalising the way that EFI protocols are opened
across the codebase, remove the efipci_open() wrapper.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently have both efipci_info() and efi_pci_info() serving
different but related purposes. Rename the latter to reduce
confusion.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit e727f57 ("[efi] Include a copy of the device path within struct
efi_device") neglected to delete the closure of the parent's device
path from the success code path in efi_snp_probe().
Reduce confusion by removing this (harmless) additional close.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some non-driver handles may have an installed component name protocol.
In particular, iPXE itself installs these protocols on its SNP device
handles, to simplify the process of delegating GetControllerName()
from our single-instance driver binding protocol to whatever child
controllers the relevant EFI driver may have installed.
For non-driver handles, the device path is more useful as debugging
information than the driver name. Limit the use of the component name
protocols to handles with a driver binding protocol installed, so that
we will end up using the device path for non-driver handles such as
the SNP device.
Continue to prefer the driver name to the device path for handles with
a driver binding protocol installed, since these will generally map to
things we are likely to conceptualise as drivers rather than as
devices.
Note that we deliberately do not use GetControllerName() to attempt to
get a human-readable name for a controller handle. In the normal
course of events, iPXE is likely to disconnect at least some existing
drivers from their controller handles. This would cause the name
obtained via GetControllerName() to change. By using the device path
instead, we ensure that the debug message name remains the same even
when the driver controlling the handle is changed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Attempt to get the veto candidate driver name from both the current
and obsolete versions of the component name protocol.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for drivers that do not install the driver binding protocol on
the image handle by opening the component name protocol on the driver
binding's ImageHandle rather than on the driver handle itself.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When hunting down a misbehaving OEM driver to add it to the veto list,
it can be very useful to know the address ranges used by each driver.
Add this information to the verbose debug messages.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The driver name is usually more informative for debug messages than
the device path from which a driver was loaded. Try using the various
mechanisms for obtaining a driver name before trying the device path.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Not all drivers will install the driver binding protocol on the image
handle. Accommodate these drivers by attempting to retrieve the
driver name via the component name protocol(s) located on the driver
binding's ImageHandle, as well as on the driver handle itself.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When DEBUG=efi_wrap is enabled, we construct a patched copy of the
boot services table and patch the global system table to point to this
copy. This ensures that any subsequently loaded EFI binaries will
call our wrappers.
Previously loaded EFI binaries will typically have cached the boot
services table pointer (in the gBS variable used by EDK2 code), and
therefore will not pick up the updated pointer and so will not call
our wrappers. In most cases, this is what we want to happen: we are
interested in tracing the calls issued by the newly loaded binary and
we do not want to be distracted by the high volume of boot services
calls issued by existing UEFI drivers.
In some circumstances (such as when a badly behaved OEM driver is
causing the system to lock up during the ExitBootServices() call), it
can be very useful to be able to patch the global boot services table
in situ, so that we can trace calls issued by existing drivers.
Restructure the wrapping code to allow wrapping to be enabled or
disabled at any time, and to allow for patching the global boot
services table in situ.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The debug wrappers for CloseEvent() and CheckEvent() are currently
both calling SignalEvent() instead (presumably due to copy-paste
errors). Astonishingly, this has generally not prevented a successful
boot in the (very rare) case that DEBUG=efi_wrap is enabled.
Fix the wrappers to call the intended functions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The virtual filesystem that we provide to expose downloaded images
will erroneously interpret filenames with redundant path separators
such as ".\filename" as an attempt to open the directory, rather than
an attempt to open "filename".
This shows up most obviously when chainloading from one iPXE into
another iPXE, when the inner iPXE may end up attempting to open
".\autoexec.ipxe" from the outer iPXE's virtual filesystem. (The
erroneously opened file will have a zero length and will therefore be
ignored, but is still confusing.)
Fix by discarding any dot or backslash characters after a potential
initial backslash. This is very liberal and will accept some
syntactically invalid paths, but this is acceptable since our virtual
filesystem does not implement directories anyway.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
On some systems (observed with an HP Elitebook 840 G10), writing
console output that happens to cause the display to scroll will modify
the system memory map. This causes builds with DEBUG=efi_wrap to
typically fail to boot, since the debug output from the wrapped
ExitBootServices() call itself is sufficient to change the memory map
and therefore cause ExitBootServices() to fail due to an invalid
memory map key.
Work around these UEFI firmware bugs by prescrolling the display after
a failed ExitBootServices() attempt, in order to minimise the chance
that further scrolling will happen during the subsequent attempt.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The python-asn1 documentation indicates that end of file may be
detected either by obtaining a True value from .eof() or by obtaining
a None value from .peek(), but does not mention any way to detect the
end of a constructed tag (rather than the end of the overall file).
We currently use .eof() to detect the end of a constructed tag, based
on the observed behaviour of the library.
The behaviour of .eof() changed between versions 2.8.0 and 3.0.0, such
that .eof() no longer returns True at the end of a constructed tag.
Switch to testing for a None value returned from .peek() to determine
when we have reached the end of a constructed tag, since this works on
both newer and older versions.
Continue to treat .eof() as a necessary but not sufficient condition
for reaching the overall end of file, to maintain compatibility with
older versions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Legacy IRQ 8 appears to be enabled by default on some platforms. If
iPXE selects the RTC entropy source, this will currently result in the
RTC IRQ 8 being unconditionally disabled. This can break assumptions
made by BIOSes or subsequent bootloaders: in particular, the FreeBSD
loader may lock up at the point of starting its default 10-second
countdown when it calls INT 15,86.
Fix by restoring the previous state of IRQ 8 instead of disabling it
unconditionally. Note that we do not need to disable IRQ 8 around the
point of hooking (or unhooking) the ISR, since this code will be
executing in iPXE's normal state of having interrupts disabled anyway.
Also restore the previous state of the RTC periodic interrupt enable,
rather than disabling it unconditionally.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Return the previous interrupt enabled state from enable_irq() and
disable_irq(), to allow callers to more easily restore this state.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
UEFI's built-in HTTPS boot mechanism requires the trusted CA
certificates to be provided via the TlsCaCertificates variable.
(There is no equivalent of the iPXE cross-signing mechanism, so it is
not possible for UEFI to automatically use public CA certificates.)
Users who have configured UEFI HTTPS boot to use a custom root of
trust (e.g. a private CA certificate) may find it useful to have iPXE
automatically pick up and use this same root of trust, so that iPXE
can seamlessly fetch files via HTTPS from the same servers that were
trusted by UEFI HTTPS boot, in addition to servers that iPXE can
validate through other means such as cross-signed certificates.
Parse the TlsCaCertificates variable at startup, add any certificates
to the certificate store, and mark these certificates as trusted.
There are no access restrictions on modifying the TlsCaCertificates
variable: anybody with access to write UEFI variables is permitted to
change the root of trust. The UEFI security model assumes that anyone
with access to run code prior to ExitBootServices() or with access to
modify UEFI variables from within a loaded operating system is
supposed to be able to change the system's root of trust for TLS.
Any certificates parsed from TlsCaCertificates will show up in the
output of "certstat", and may be discarded using "certfree" if
unwanted.
Support for parsing TlsCaCertificates is enabled by default in EFI
builds, but may be disabled in config/general.h if needed.
As with the ${trust} setting, the contents of the TlsCaCertificates
variable will be ignored if iPXE has been compiled with an explicit
root of trust by specifying TRUST=... on the build command line.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the TlsAuthentication.h header from EDK2's NetworkPkg, along with
a GUID definition for EFI_TLS_CA_CERTIFICATE_GUID.
It is unclear whether or not the TlsCaCertificate variable is intended
to be a UEFI standard. Its presence in NetworkPkg (rather than
MdePkg) suggests not, but the choice of EFI_TLS_CA_CERTIFICATE_GUID
(rather than e.g. EDKII_TLS_CA_CERTIFICATE_GUID) suggests that it is
intended to be included in future versions of the standard.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for the possibility of creating empty directories (without
having to include a dummy file inside the directory) using a
zero-length image and a CPIO filename with a trailing slash, such as:
initrd emptyfile /usr/share/oem/
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Commit 12ea8c4 ("[cpio] Allow for construction of parent directories
as needed") introduced a regression in constructing CPIO archive
headers for relative paths (e.g. simple filenames with no leading
slash).
Fix by counting the number of path components rather than the number
of path separators, and add some test cases to cover CPIO header
construction.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for the EFI signature list image format (as produced by
tools such as efisecdb).
The parsing code does not require any EFI boot services functions and
so may be enabled even in non-EFI builds. We default to enabling it
only for EFI builds.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently provide pem_asn1() to allow for parsing of PEM data that
is not necessarily contained in an image. Provide an equivalent
function der_asn1() to allow for similar parsing of DER data.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The debug message transcription of well-known EFI GUIDs does not
require any EFI boot services calls. Move this code from efi_debug.c
to efi_guid.c, to allow it to be linked in to non-EFI builds.
We continue to rely on linker garbage collection to ensure that the
code is omitted completely from any non-debug builds.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The EFI host utilities (such as elf2efi64, efirom, etc) include the
EDK2 headers, which include static assertions to ensure that they are
built with -fshort-wchar enabled. When building the host utilities,
we currently bypass these assertions by defining MDE_CPU_EBC. The EBC
compiler apparently does not support static assertions, and defining
MDE_CPU_EBC therefore causes EDK2's Base.h to define STATIC_ASSERT()
as a no-op.
Newer versions of the EDK2 headers omit the check for MDE_CPU_EBC (and
will presumably therefore fail to build with the EBC compiler). This
causes our host utility builds to fail since the static assertion now
detects that we are building with the host's default ABI (i.e. without
enabling -fshort-wchar).
Fix by enabling -fshort-wchar when building EFI host utilities. This
produces binaries that are technically incompatible with the host ABI.
However, since our host utilities never handle any wide-character
strings, this nominal ABI incompatiblity has no effect.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UsbHostController.h header has been removed from the EDK2 codebase
since it was never defined in a released UEFI specification. However,
we may still encounter it in the wild and so it is useful to retain
the GUID and the corresponding protocol name for debug messages.
Add an iPXE include guard to this file so that the EDK2 header import
script will no longer attempt to import it from the EDK2 tree.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The bzImage specification allows two bytes for the setup code jump
instruction at offset 0x200, which limits its relative offset to +0x7f
bytes. This currently imposes an upper limit on the length of the
version string, which currently precedes the setup code.
Fix by moving the version string to the .prefix.data section, so that
it no longer affects the placement of the setup code.
Originally-fixed-by: Miao Wang <shankerwangmiao@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
iPXE allows individual raw files to be automatically wrapped with
suitable CPIO headers and injected into the magic initrd image as
exposed to a booted Linux kernel. This feature is currently limited
to placing files within directories that already exist in the initrd
filesystem.
Remove this limitation by adding the ability for iPXE to construct
CPIO headers for parent directories as needed, under control of the
"mkdir=<n>" command-line argument. For example:
initrd config.ign /usr/share/oem/config.ign mkdir=1
will create CPIO headers for the "/usr/share/oem" directory as well as
for the "/usr/share/oem/config.ign" file itself.
This simplifies the process of booting operating systems such as
Flatcar Linux, which otherwise require the single "config.ign" file to
be manually wrapped up as a CPIO archive solely in order to create the
relevant parent directory entries.
The value <n> may be used to control the number of parent directory
entries that are created. For example, "mkdir=2" would cause up to
two parent directories to be created (i.e. "/usr/share" and
"/usr/share/oem" in the above example). A negative value such as
"mkdir=-1" may be used to create all parent directories up to the root
of the tree.
Do not create any parent directory entries by default, since doing so
would potentially cause the modes and ownership information for
existing directories to be overwritten.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow the "--retimeout" option to be used to specify a timeout value
that will be (re)applied after each keypress activity. This allows
script authors to ensure that a single (potentially accidental)
keypress will not pause the boot process indefinitely.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The ANS X9.82 specification implicitly assumes that the RBG_Startup
function will be called before it is needed, and includes checks to
make sure that Generate_function fails if this has not happened.
However, there is no well-defined point at which the RBG_Startup
function is to be called: it's just assumed that this happens as part
of system startup.
We currently call RBG_Startup to instantiate the DRBG as an iPXE
startup function, with the corresponding shutdown function
uninstantiating the DRBG. This works for most use cases, and avoids
an otherwise unexpected user-visible delay when a caller first
attempts to use the DRBG (e.g. by attempting an HTTPS download).
The download of autoexec.ipxe for UEFI is triggered by the EFI root
bus probe in efi_probe(). Both the root bus probe and the RBG startup
function run at STARTUP_NORMAL, so there is no defined ordering
between them. If the base URI for autoexec.ipxe uses HTTPS, then this
may cause random bits to be requested before the RBG has been started.
Extend the logic in rbg_generate() to automatically start up the RBG
if startup has not already been attempted. If startup fails
(e.g. because the entropy source is broken), then do not automatically
retry since this could result in extremely long delays waiting for
entropy that will never arrive.
Reported-by: Michael Niehaus <niehaus@live.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In almost all cases, the download timeout for autoexec.ipxe is
irrelevant: the operation will either succeed or fail relatively
quickly (e.g. due to a nonexistent file). The overall download
timeout exists only to ensure that an unattended or headless system
will not wait indefinitely in the case of a degenerate network
response (e.g. an HTTP server that returns an endless trickle of data
using chunked transfer encoding without ever reaching the end of the
file).
The current download timeout is too short if PeerDist content encoding
is enabled, since the overall download will abort before the first
peer discovery attempt has completed, and without allowing sufficient
time for an origin server range request.
The single timeout value is currently used for both the download
timeout and the sync timeout. The latter timeout exists only to allow
network communication to be gracefully quiesced before removing the
temporary MNP network device, and may safely be shortened without
affecting functionality.
Fix by increasing the download timeout from two seconds to 30 seconds,
and defining a separate one-second timeout for the sync operation.
Reported-by: Michael Niehaus <niehaus@live.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The only remaining use case for direct reduction (outside of the unit
tests) is in calculating the constant R^2 mod N used during Montgomery
multiplication.
The current implementation of direct reduction requires a writable
copy of the modulus (to allow for shifting), and both the modulus and
the result buffer must be padded to be large enough to hold (R^2 - N),
which is twice the size of the actual values involved.
For the special case of reducing R^2 mod N (or any power of two mod
N), we can run the same algorithm without needing either a writable
copy of the modulus or a padded result buffer. The working state
required is only two bits larger than the result buffer, and these
additional bits may be held in local variables instead.
Rewrite bigint_reduce() to handle only this use case, and remove the
no longer necessary uses of double-sized big integers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
When allocating memory with a non-zero alignment offset, the free
memory block structure following the allocation may end up improperly
aligned.
Ensure that free memory blocks always remain aligned to the size of
the free memory block structure.
Ensure that the initial heap is also correctly aligned, thereby
allowing the logic for leaking undersized free memory blocks to be
omitted.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The NIST elliptic curves are Weierstrass curves and have the form
y^2 = x^3 + ax + b
with each curve defined by its field prime, the constants "a" and "b",
and a generator base point.
Implement a constant-time algorithm for point addition, based upon
Algorithm 1 from "Complete addition formulas for prime order elliptic
curves" (Joost Renes, Craig Costello, and Lejla Batina), and use this
as a Montgomery ladder commutative operation to perform constant-time
point multiplication.
The code for point addition is implemented using a custom bytecode
interpreter with 16-bit instructions, since this results in
substantially smaller code than compiling the somewhat lengthy
sequence of arithmetic operations directly. Values are calculated
modulo small multiples of the field prime in order to allow for the
use of relaxed Montgomery reduction.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Montgomery ladder may be used to perform any operation that is
isomorphic to exponentiation, i.e. to compute the result
r = g^e = g * g * g * g * .... * g
for an arbitrary commutative operation "*", base or generator "g", and
exponent "e".
Implement a generic Montgomery ladder for use by both modular
exponentiation and elliptic curve point multiplication (both of which
are isomorphic to exponentiation).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The elliptic curve point representation for the x25519 curve includes
only the X value, since the curve is designed such that the Montgomery
ladder does not need to ever know or calculate a Y value. There is no
curve point format byte: the public key data is simply the X value.
The pre-master secret is also simply the X value of the shared secret
curve point.
The point representation for the NIST curves includes both X and Y
values, and a single curve point format byte that must indicate that
the format is uncompressed. The pre-master secret for the NIST curves
does not include both X and Y values: only the X value is used.
Extend the definition of an elliptic curve to allow the point size to
be specified separately from the key size, and extend the definition
of a TLS named curve to include an optional curve point format byte
and a pre-master secret length.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Split out the portion of tls_send_client_key_exchange_ecdhe() that
actually performs the elliptic curve key exchange into a separate
function ecdhe_key().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In debug messages, big integers are currently printed as hex dumps.
This is quite verbose and cumbersome to check against external
sources.
Add bigint_ntoa() to transcribe big integers into a static buffer
(following the model of inet_ntoa(), eth_ntoa(), uuid_ntoa(), etc).
Abbreviate big integers that will not fit within the static buffer,
showing both the most significant and least significant digits in the
transcription. This is generally the most useful form when visually
comparing against external sources (such as test vectors, or results
produced by high-level languages).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Calculating the Montgomery constant (R^2 mod N) is done in our
implementation by zeroing the double-width representation of N,
subtracting N once to give (R^2 - N) in order to obtain a positive
value, then reducing this value modulo N.
Extract this logic from bigint_mod_exp() to a separate function
bigint_reduce_supremum(), to allow for reuse by other code.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Classic Montgomery reduction involves a single conditional subtraction
to ensure that the result is strictly less than the modulus.
When performing chains of Montgomery multiplications (potentially
interspersed with additions and subtractions), it can be useful to
work with values that are stored modulo some small multiple of the
modulus, thereby allowing some reductions to be elided. Each addition
and subtraction stage will increase this running multiple, and the
following multiplication stages can be used to reduce the running
multiple since the reduction carried out for multiplication products
is generally strong enough to absorb some additional bits in the
inputs. This approach is already used in the x25519 code, where
multiplication takes two 258-bit inputs and produces a 257-bit output.
Split out the conditional subtraction from bigint_montgomery() and
provide a separate bigint_montgomery_relaxed() for callers who do not
require immediate reduction to within the range of the modulus.
Modular exponentiation could potentially make use of relaxed
Montgomery multiplication, but this would require R>4N, i.e. that the
two most significant bits of the modulus be zero. For both RSA and
DHE, this would necessitate extending the modulus size by one element,
which would negate any speed increase from omitting the conditional
subtractions. We therefore retain the use of classic Montgomery
reduction for modular exponentiation, apart from the final conversion
out of Montgomery form.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Reduce the number of parameters passed to bigint_montgomery() by
calculating the inverse of the modulus modulo the element size on
demand. Cache the result, since Montgomery reduction will be used
repeatedly with the same modulus value.
In all currently supported algorithms, the modulus is a public value
(or a fixed value defined by specification) and so this non-constant
timing does not leak any private information.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The startup process is scheduled to run when the device is opened and
terminated (if still running) when the device is closed. It assumes
that the resource allocation performed in gve_open() has taken place,
and that the admin and transmit/receive data structure pointers are
therefore valid.
The process initialisation in gve_probe() erroneously calls
process_init() rather than process_init_stopped() and will therefore
schedule the startup process immediately, before the relevant
resources have been allocated.
This bug is masked in the typical use case of a Google Cloud instance
with a single NIC built with the config/cloud/gce.ipxe embedded
script, since the embedded script will immediately open the NIC (and
therefore allocate the required resources) before the scheduled
process is allowed to run for the first time. In a multi-NIC
instance, undefined behaviour will arise as soon as the startup
process for the second NIC is allowed to run.
Fix by using process_init_stopped() to avoid implicitly scheduling the
startup process during gve_probe().
Originally-fixed-by: Kal Cutter Conley <kalcutterc@nvidia.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no further need for a standalone modular multiplication
primitive, since the only consumer is modular exponentiation (which
now uses Montgomery multiplication instead).
Remove the now obsolete bigint_mod_multiply().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Speed up modular exponentiation by using Montgomery reduction rather
than direct modular reduction.
Montgomery reduction in base 2^n requires the modulus to be coprime to
2^n, which would limit us to requiring that the modulus is an odd
number. Extend the implementation to include support for
exponentiation with even moduli via Garner's algorithm as described in
"Montgomery reduction with even modulus" (Koç, 1994).
Since almost all use cases for modular exponentation require a large
prime (and hence odd) modulus, the support for even moduli could
potentially be removed in future.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Montgomery reduction is substantially faster than direct reduction,
and is better suited for modular exponentiation operations.
Add bigint_montgomery() to perform the Montgomery reduction operation
(often referred to as "REDC"), along with some test vectors.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Montgomery reduction requires only the least significant element of an
inverse modulo 2^k, which in turn depends upon only the least
significant element of the invertend.
Use the inverse size (rather than the invertend size) as the effective
size for bigint_mod_invert(). This eliminates around 97% of the loop
iterations for a typical 2048-bit RSA modulus.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
With a slight modification to the algorithm to ignore bits of the
residue that can never contribute to the result, it is possible to
reuse the as-yet uncalculated portions of the inverse to hold the
residue. This removes the requirement for additional temporary
working space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Direct modular reduction is expected to be used in situations where
there is no requirement to retain the original (unreduced) value.
Modify the API for bigint_reduce() to reduce the value in place,
(removing the separate result buffer), impose a constraint that the
modulus and value have the same size, and require the modulus to be
passed in writable memory (to allow for scaling in place). This
removes the requirement for additional temporary working space.
Reverse the order of arguments so that the constant input is first,
to match the usage pattern for bigint_add() et al.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Expose the effective carry (or borrow) out flag from big integer
addition and subtraction, and use this to elide an explicit bit test
when performing x25519 reduction.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a dedicated bigint_msb_is_set() to reduce the amount of open
coding required in the common case of testing the sign of a two's
complement big integer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
UEFI systems may choose not to connect drivers for local disk drives
when the boot policy is set to attempt a network boot. This may cause
the "sanboot" command to be unable to boot from a local drive, since
the relevant block device and filesystem drivers may not have been
connected.
Fix by ensuring that all available drivers are connected before
attempting to boot from an EFI block device.
Reported-by: Andrew Cottrell <andrew.cottrell@xtxmarkets.com>
Tested-by: Andrew Cottrell <andrew.cottrell@xtxmarkets.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently require the variable CROSS (or CROSS_COMPILE) to be set
to specify the global cross-compilation prefix. This becomes
cumbersome when developing across multiple CPU architectures,
requiring frequent editing of build command lines and preventing
incompatible architectures from being built with a single command.
Allow a default cross-compilation prefix for each architecture to be
specified via the CROSS_COMPILE_<arch> variables. These may then be
provided as environment variables, e.g. using
export CROSS_COMPILE_arm32=arm-linux-gnu-
export CROSS_COMPILE_arm64=aarch64-linux-gnu-
export CROSS_COMPILE_loong64=loongarch64-linux-gnu-
export CROSS_COMPILE_riscv32=riscv64-linux-gnu-
export CROSS_COMPILE_riscv64=riscv64-linux-gnu-
This change requires some portions of the Makefile to be rearranged,
to allow for the fact that $(CROSS_COMPILE) may not have been set
until the build directory has been parsed to determine the CPU
architecture.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The seed CSR defined by the Zkr extension is accessible only in M-mode
by default. Older versions of OpenSBI (prior to version 1.4) do not
set mseccfg.sseed, with the result that attempts to access the seed
CSR from S-mode will raise an illegal instruction exception.
Add a facility for testing the accessibility of arbitrary CSRs, and
use it to check that the seed CSR is accessible before reporting the
seed CSR entropy source as being functional.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add basic support for running directly on top of SBI, with no UEFI
firmware present. Build as e.g.:
make CROSS=riscv64-linux-gnu- bin-riscv64/ipxe.sbi
The resulting binary can be tested in QEMU using e.g.:
qemu-system-riscv64 -M virt -cpu max -serial stdio \
-kernel bin-riscv64/ipxe.sbi
No drivers or executable binary formats are supported yet, but the
unit test suite may be run successfully.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Restructure the parsing of the build directory name from
bin[[-<arch>]-<platform>]
to
bin[-<arch>[-<platform>]]
and allow for a per-architecture default build platform.
For the sake of backwards compatibility, handle "bin-efi" as a special
case equivalent to "bin-i386-efi".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The timer and entropy seed CSRs will, by design, return different
values each time they are read.
Add the missing volatile qualifiers on the inline assembly to prevent
gcc from assuming that repeated invocations may be elided.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Zkr entropy source extension defines a potentially unprivileged
seed CSR that can be read to obtain 16 bits of entropy input, with a
mandated requirement that 256 entropy input bits read from the seed
CSR will contain at least 128 bits of min-entropy.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Zicntr extension defines an unprivileged wall-clock time CSR that
roughly matches the behaviour of an invariant TSC on x86. The nominal
frequency of this timer may be read from the "timebase-frequency"
property of the CPU node in the device tree.
Add a timer source using RDTIME to provide implementations of udelay()
and currticks(), modelled on the existing RDTSC-based timer for x86.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RISC-V seems to allow for direct discovery of CPU features only from
M-mode (e.g. by setting up a trap handler and then attempting to
access a CSR), with S-mode code expected to read the resulting
constructed ISA description from the device tree.
Add the ability to check for the presence of named extensions listed
in the "riscv,isa" property of the device tree node corresponding to
the boot hart.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for the existence of platforms with no PCI bus by including the
PCI settings mechanism only if PCI bus support is included.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Running with flat physical addressing is a fairly common early boot
environment. Rename UACCESS_EFI to UACCESS_FLAT so that this code may
be reused in non-UEFI boot environments that also use flat physical
addressing.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the ability to issue Supervisor Binary Interface (SBI) calls via
the ECALL instruction, and use the SBI DBCN extension to implement a
debug console.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Montgomery multiplication requires calculating the inverse of the
modulus modulo a larger power of two.
Add bigint_mod_invert() to calculate the inverse of any (odd) big
integer modulo an arbitrary power of two, using a lightly modified
version of the algorithm presented in "A New Algorithm for Inversion
mod p^k (Koç, 2017)".
The power of two is taken to be 2^k, where k is the number of bits
available in the big integer representation of the invertend. The
inverse modulo any smaller power of two may be obtained simply by
masking off the relevant bits in the inverse.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow scripts to read basic information from USB device descriptors
via the settings mechanism. For example:
echo USB vendor ID: ${usb/${busloc}.8.2}
echo USB device ID: ${usb/${busloc}.10.2}
echo USB manufacturer name: ${usb/${busloc}.14.0}
The general syntax is
usb/<bus:dev>.<offset>.<length>
where bus:dev is the USB bus:device address (as obtained via the
"usbscan" command, or from e.g. ${net0/busloc} for a USB network
device), and <offset> and <length> select the required portion of the
USB device descriptor.
Following the usage of SMBIOS settings tags, a <length> of zero may be
used to indicate that the byte at <offset> contains a USB string
descriptor index, and an <offset> of zero may be used to indicate that
the <length> contains a literal USB string descriptor index.
Since the byte at offset zero can never contain a string index, and a
literal string index can never be zero, the combination of both
<length> and <offset> being zero may be used to indicate that the
entire device descriptor is to be read as a raw hex dump.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Implement a "usbscan" command as a direct analogy of the existing
"pciscan" command, allowing scripts to iterate over all detected USB
devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Faster modular multiplication algorithms such as Montgomery
multiplication will still require the ability to perform a single
direct modular reduction.
Neaten up the implementation of direct reduction and split it out into
a separate bigint_reduce() function, complete with its own unit tests.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Every architecture uses the same implementation for bigint_is_set(),
and there is no reason to suspect that a future CPU architecture will
provide a more efficient way to implement this operation.
Simplify the code by providing a single architecture-independent
implementation of bigint_is_set().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The big integer shift operations are misleadingly described as
rotations since the original x86 implementations are essentially
trivial loops around the relevant rotate-through-carry instruction.
The overall operation performed is a shift rather than a rotation.
Update the function names and descriptions to reflect this.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
An n-bit multiplication product may be added to up to two n-bit
integers without exceeding the range of a (2n)-bit integer:
(2^n - 1)*(2^n - 1) + (2^n - 1) + (2^n - 1) = 2^(2n) - 1
Exploit this to perform big integer multiplication in constant time
without requiring the caller to provide temporary carry space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for building as a Linux userspace binary for AArch32.
This allows the self-test suite to be more easily run for the 32-bit
ARM code. For example:
make CROSS=arm-linux-gnu- bin-arm32-linux/tests.linux
qemu-arm -L /usr/arm-linux-gnu/sys-root/ \
./bin-arm32-linux/tests.linux
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Reading from PMCCNTR causes an undefined instruction exception when
running in PL0 (e.g. as a Linux userspace binary), unless the
PMUSERENR.EN bit is set.
Restructure profile_timestamp() for 32-bit ARM to perform an
availability check on the first invocation, with subsequent
invocations returning zero if PMCCNTR could not be enabled.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
All consumers of profile_timestamp() currently treat the value as an
unsigned long. Only the elapsed number of ticks is ever relevant: the
absolute value of the timestamp is not used. Profiling is used to
measure short durations that are generally fewer than a million CPU
cycles, for which an unsigned long is easily large enough.
Standardise the return type of profile_timestamp() as unsigned long
across all CPU architectures. This allows 32-bit architectures such
as i386 and riscv32 to omit all logic associated with retrieving the
upper 32 bits of the 64-bit hardware counter, which simplifies the
code and allows riscv32 and riscv64 to share the same implementation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Big integer multiplication currently performs immediate carry
propagation from each step of the long multiplication, relying on the
fact that the overall result has a known maximum value to minimise the
number of carries performed without ever needing to explicitly check
against the result buffer size.
This is not a constant-time algorithm, since the number of carries
performed will be a function of the input values. We could make it
constant-time by always continuing to propagate the carry until
reaching the end of the result buffer, but this would introduce a
large number of redundant zero carries.
Require callers of bigint_multiply() to provide a temporary carry
storage buffer, of the same size as the result buffer. This allows
the carry-out from the accumulation of each double-element product to
be accumulated in the temporary carry space, and then added in via a
single call to bigint_add() after the multiplication is complete.
Since the structure of big integer multiplication is identical across
all current CPU architectures, provide a single shared implementation
of bigint_multiply(). The architecture-specific operation then
becomes the multiplication of two big integer elements and the
accumulation of the double-element product.
Note that any intermediate carry arising from accumulating the lower
half of the double-element product may be added to the upper half of
the double-element product without risk of overflow, since the result
of multiplying two n-bit integers can never have all n bits set in its
upper half. This simplifies the carry calculations for architectures
such as RISC-V and LoongArch64 that do not have a carry flag.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The admin queue API requires us to tell the device how many event
counters we have provided via the "configure device resources" admin
queue command. There is, of course, absolutely no documentation
indicating how many event counters actually need to be provided.
We require only two event counters: one for the transmit queue, one
for the receive queue. (The receive queue doesn't seem to actually
make any use of its event counter, but the "create receive queue"
admin queue command will fail if it doesn't have an available event
counter to choose.)
In the absence of any documentation, we currently make the assumption
that allocating and configuring 16 counters (i.e. one whole cacheline)
will be sufficient to allow for the use of two counters.
This assumption turns out to be incorrect. On larger instance types
(observed with a c3d-standard-16 instance in europe-west4-a), we find
that creating the transmit or receive queues will each fail with a
probability of around 50% with the "failed precondition" error code.
Experimentation suggests that even though the device has accepted our
"configure device resources" command indicating that we are providing
only 16 event counters, it will attempt to choose any of its potential
32 event counters (and will then fail since the event counter that it
unilaterally chose is outside of the agreed range).
Work around this firmware bug by always allocating the maximum number
of event counters supported by the device. (This requires deferring
the allocation of the event counters until after issuing the "describe
device" command.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As of commit 79c0173 ("[build] Create util/genfsimg for building
filesystem-based images"), the EFI boot file name for each CPU
architecture is defined within the genfsimg script itself, rather than
being passed in as a Makefile parameter.
Remove the now-redundant Makefile definitions for EFI_BOOT_FILE.
Reported-by: Christian I. Nilsson <nikize@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for building iPXE as a 64-bit or 32-bit RISC-V binary, for
either UEFI or Linux userspace platforms. For example:
# RISC-V 64-bit UEFI
make CROSS=riscv64-linux-gnu- bin-riscv64-efi/ipxe.efi
# RISC-V 32-bit UEFI
make CROSS=riscv64-linux-gnu- bin-riscv32-efi/ipxe.efi
# RISC-V 64-bit Linux
make CROSS=riscv64-linux-gnu- bin-riscv64-linux/tests.linux
qemu-riscv64 -L /usr/riscv64-linux-gnu/sys-root \
./bin-riscv64-linux/tests.linux
# RISC-V 32-bit Linux
make CROSS=riscv64-linux-gnu- SYSROOT=/usr/riscv32-linux-gnu/sys-root \
bin-riscv32-linux/tests.linux
qemu-riscv32 -L /usr/riscv32-linux-gnu/sys-root \
./bin-riscv32-linux/tests.linux
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The cross-compiler will typically use the appropriate sysroot
directory automatically. This may not work for toolchains where a
single cross-compiler is used to produce output for multiple CPU
variants (e.g. 32-bit and 64-bit RISC-V).
Add a SYSROOT=... parameter that may be used to specify the relevant
sysroot directory, e.g.
make CROSS=riscv64-linux-gnu- SYSROOT=/usr/riscv32-linux-gnu/sys-root \
bin-riscv32-linux/tests.linux
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The EDK2 header macros VA_START(), VA_ARG() etc produce build errors
on some CPU architectures (notably on 32-bit RISC-V, which is not yet
supported by EDK2).
Fix by using the standard variable argument list macros.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For some 32-bit CPUs, we need to provide implementations of 64-bit
shifts as libgcc helper functions. Add test cases to cover these.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Define a cpu_halt() function which is architecture-specific but
platform-independent, and merge the multiple architecture-specific
implementations of the EFI cpu_nap() function into a single central
efi_cpu_nap() that uses cpu_halt() if applicable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The definitions of the setjmp() and longjmp() functions are common to
all architectures, with only the definition of the jump buffer
structure being architecture-specific.
Move the architecture-specific portions to bits/setjmp.h and provide a
common setjmp.h for the function definitions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the "--retain <N>" option to limit the number of retained old AMI
images (within the same family, architecture, and public visibility).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for easier identification of images and snapshots created by the
aws-import script by adding tags for image family (e.g. "iPXE") and
architecture (e.g. "x86_64") to both.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As described in commit 3b81a4e ("[ena] Provide a host information
page"), we currently report an operating system type of "Linux" in
order to work around broken versions of the ENA firmware that will
fail to create a completion queue if we report the correct operating
system type.
As of September 2024, the ENA team at AWS assures us that the entire
AWS fleet has been upgraded to fix this bug, and that we are now safe
to report the correct operating system type value in the "type" field
of struct ena_host_info.
The ENA team has also clarified that at least some deployed versions
of the ENA firmware still have the defect that requires us to report
an operating system version number of 2 (regardless of operating
system type), and so we continue to report ENA_HOST_INFO_VERSION_WTF
in the "version" field of struct ena_host_info.
Add an explicit warning on the previous known failure path, in case
some deployed versions of the ENA firmware turn out to not have been
upgraded as expected.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Move the <gdbmach.h> file to <bits/gdbmach.h>, and provide a common
dummy implementation for all architectures that have not yet
implemented support for GDB.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Simplify the process of adding a new CPU architecture by providing
common implementations of typically empty architecture-specific header
files.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
This patch adds support for the AQtion Ethernet controller, enabling
iPXE to recognize and utilize the specific models (AQC114, AQC113, and
AQC107).
Tested-by: Animesh Bhatt <animeshb@marvell.com>
Signed-off-by: Animesh Bhatt <animeshb@marvell.com>
The link status check in falcon_xaui_link_ok() reads from the
FCN_XX_CORE_STAT_REG_MAC register only on production hardware (where
the FPGA version reads as zero), but modifies the value and writes
back to this register unconditionally. This triggers an uninitialised
variable warning on newer versions of gcc.
Fix by assuming that the register exists only on production hardware,
and so moving the "modify-write" portion of the "read-modify-write"
operation to also be covered by the same conditional check.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the "imgdecrypt" command that can be used to decrypt a detached
encrypted data image using a cipher key obtained from a separate CMS
envelope image. For example:
# Create non-detached encrypted CMS messages
#
openssl cms -encrypt -binary -aes-256-gcm -recip client.crt \
-in vmlinuz -outform DER -out vmlinuz.cms
openssl cms -encrypt -binary -aes-256-gcm -recip client.crt \
-in initrd.img -outform DER -out initrd.img.cms
# Detach data from envelopes (using iPXE's contrib/crypto/cmsdetach)
#
cmsdetach vmlinuz.cms -d vmlinuz.dat -e vmlinuz.env
cmsdetach initrd.img.cms -d initrd.img.dat -e initrd.img.env
and then within iPXE:
#!ipxe
imgfetch http://192.168.0.1/vmlinuz.dat
imgfetch http://192.168.0.1/initrd.img.dat
imgdecrypt vmlinuz.dat http://192.168.0.1/vmlinuz.env
imgdecrypt initrd.img.dat http://192.168.0.1/initrd.img.env
boot vmlinuz
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for decrypting images containing detached encrypted data
using a cipher key obtained from a separate CMS envelope image (in DER
or PEM format).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The openssl toolchain does not currently seem to support creating CMS
envelopedData or authEnvelopedData messages with detached encrypted
data.
Add a standalone tool "cmsdetach" that can be used to detach the
encrypted data from a CMS message. For example:
openssl cms -encrypt -binary -aes-256-gcm -recip client.crt \
-in bootfile -outform DER -out bootfile.cms
cmsdetach bootfile.cms --data bootfile.dat --envelope bootfile.env
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Generalise CMS self-test data structure and macro names to refer to
"messages" rather than "signatures", in preparation for adding image
decryption tests.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Some ASN.1 OID-identified algorithms require additional parameters,
such as an initialisation vector for a block cipher. The structure of
the parameters is defined by the individual algorithm.
Extend asn1_algorithm() to allow these additional parameters to be
returned via a separate ASN.1 cursor.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Reduce the number of dynamic allocations required to parse a CMS
message by retaining the ASN.1 cursor returned from image_asn1() for
the lifetime of the CMS message. This allows embedded ASN.1 cursors
to be used for parsed objects within the message, such as embedded
signatures.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Instances of cipher and digest algorithms tend to get called
repeatedly to process substantial amounts of data. This is not true
for public-key algorithms, which tend to get called only once or twice
for a given key.
Simplify the public-key algorithm API so that there is no reusable
algorithm context. In particular, this allows callers to omit the
error handling currently required to handle memory allocation (or key
parsing) errors from pubkey_init(), and to omit the cleanup calls to
pubkey_final().
This change does remove the ability for a caller to distinguish
between a verification failure due to a memory allocation failure and
a verification failure due to a bad signature. This difference is not
material in practice: in both cases, for whatever reason, the caller
was unable to verify the signature and so cannot proceed further, and
the cause of the error will be visible to the user via the return
status code.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The TLS connection structure has grown to become unmanageably large as
new features and support for new TLS protocol versions have been added
over time.
Split out the portions of struct tls_connection that are specific to
client and server operations into separate structures, and simplify
some structure field names.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The TLS connection structure has grown to become unmanageably large as
new features and support for new TLS protocol versions have been added
over time.
Split out the portions of struct tls_connection that are specific to
transmit and receive operations into separate structures, and simplify
some structure field names.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The rom-o-matic code does not form part of the iPXE codebase, has not
been maintained for over a decade, and does not appear to still be in
use anywhere in the world.
It does, however, result in a large number of false positive security
vulnerability reports from some low quality automated code analysis
tools such as Fortify SCA.
Remove this unused and obsolete code to reduce the burden of
responding to these false positives.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Generalise the existing support for performing RSA public-key
encryption, decryption, signature, and verification tests, and update
the code to use okx() for neater reporting of test results.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Asymmetric keys are invariably encountered within ASN.1 structures
such as X.509 certificates, and the various large integers within an
RSA key are themselves encoded using ASN.1.
Simplify all code handling asymmetric keys by passing keys as a single
ASN.1 cursor, rather than separate data and length pointers.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Generalise the logic for identifying the matching PCI root bridge I/O
protocol to allow for identifying the closest matching PCI bus:dev.fn
address range, and use this to provide PCI address range discovery
(while continuing to inhibit automatic PCI bus probing).
This allows the "pciscan" command to work as expected under UEFI.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The UEFI device model requires us to not probe the PCI bus directly,
but instead to wait to be offered the opportunity to drive devices via
our driver service binding handle.
We currently inhibit PCI bus probing by having pci_discover() return
an empty range when using the EFI PCI I/O API. This has the unwanted
side effect that scanning the bus manually using the "pciscan" command
will also fail to discover any devices.
Separate out the concept of being allowed to probe PCI buses from the
mechanism for discovering PCI bus:dev.fn address ranges, so that this
limitation may be removed.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
An attempt to use a validator for an empty certificate chain will
correctly fail the overall validation with the "empty certificate
chain" error propagated from x509_auto_append().
In a debug build, the call to validator_name() will attempt to call
x509_name() on a non-existent certificate, resulting in garbage in the
debug message.
Fix by checking for the special case of an empty certificate chain.
This issue does not affect non-debug builds, since validator_name() is
(as per its description) called only for debug messages.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is some exploitable similarity between the data structures used
for representing CMS signatures and CMS encryption keys. In both
cases, the CMS message fundamentally encodes a list of participants
(either message signers or message recipients), where each participant
has an associated certificate and an opaque octet string representing
the signature or encrypted cipher key. The ASN.1 structures are not
identical, but are sufficiently similar to be worth exploiting: for
example, the SignerIdentifier and RecipientIdentifier data structures
are defined identically.
Rename data structures and functions, and add the concept of a CMS
message type.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Extend the definition of an ASN.1 OID-identified algorithm to include
a potential cipher suite, and add identifiers for AES-CBC and AES-GCM.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The cms_signature() and cms_verify() functions currently accept raw
data pointers. This will not be possible for cms_decrypt(), which
will need the ability to extract fragments of ASN.1 data from a
potentially large image.
Change cms_signature() and cms_verify() to accept an image as an input
parameter, and move the responsibility for setting the image trust
flag within cms_verify() since that now becomes a more natural fit.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow passing a NULL value for the certificate list to all functions
used for identifying an X.509 certificate from an existing set of
certificates, and rename function parameters to indicate that this
certificate list represents an unordered certificate store (rather
than an ordered certificate chain).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Centralise all current mechanisms for identifying an X.509 certificate
(by raw content, by subject, by issuer and serial number, and by
matching public key), and remove the certstore-specific and
CMS-specific variants of these functions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Handling large ASN.1 objects such as encrypted CMS files will require
the ability to use the asn1_enter() and asn1_skip() family of
functions on partial object cursors, where a defined additional length
is known to exist after the end of the data buffer pointed to by the
ASN.1 object cursor.
We already have support for partial object cursors in the underlying
asn1_start() operation used by both asn1_enter() and asn1_skip(), and
this is used by the DER image probe routine to check that the
potential DER file comprises a single ASN.1 SEQUENCE object.
Add asn1_enter_partial() to formalise the process of entering an ASN.1
partial object, and refactor the DER image probe routine to use this
instead of open-coding calls to the underlying asn1_start() operation.
There is no need for an equivalent asn1_skip_partial() function, since
only objects that are wholly contained within the partial cursor may
be successfully skipped.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Calling asn1_skip_if_exists() on a malformed ASN.1 object may
currently leave the cursor in a partially-updated state, where the tag
byte and one of the length bytes have been stripped. The cursor is
left with a valid data pointer and length and so no out-of-bounds
access can arise, but the cursor no longer points to the start of an
ASN.1 object.
Ensure that each ASN.1 cursor manipulation code path leads to the
cursor being either fully updated, left unmodified, or invalidated,
and update the function descriptions to reflect this.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Successfully reaching the end of a well-formed ASN.1 object list is
arguably not an error, but the current code (dating back to the
original ASN.1 commit in 2007) will explicitly check for and report
this as an error condition.
Remove the explicit check for reaching the end of a well-formed ASN.1
object list, and instead return success along with a zero-length (and
hence implicitly invalidated) cursor.
Almost every existing caller of asn1_skip() or asn1_skip_if_exists()
currently ignores the return value anyway. Skipped objects are (by
definition) not of interest to the caller, and the invalidation
behaviour of asn1_skip() ensures that any errors will be safely caught
on a subsequent attempt to actually use the ASN.1 object content.
Since these existing callers ignore the return value, they cannot be
affected by this change.
There is one existing caller of asn1_skip_if_exists() that does check
the return value: in asn1_skip() itself, an error returned from
asn1_skip_if_exists() will cause the cursor to be invalidated. In the
case of an error indicating only that the cursor length is already
zero, invalidation is a no-op, and so this change affects only the
return value propagated from asn1_skip().
This leaves only a single call site within ocsp_request() where the
return value from asn1_skip() is currently checked. The return status
here is moot since there is no way for the code in question to fail
(absent a bug in the ASN.1 construction or parsing code).
There are therefore no callers of asn1_skip() or asn1_skip_if_exists()
that rely on an error being returned for successfully reaching the end
of a well-formed ASN.1 object list. Simplify the code by redefining
this as a successful outcome.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Redefine bit 30 of an SMBIOS numerical setting to be part of the
function number, in order to allow access to hypervisor CPUID leaves.
This technically breaks backwards compatibility with scripts
attempting to read more than 64 consecutive functions. Since there is
no meaningful block of 64 consecutive related functions, it is
vanishingly unlikely that this capability has ever been used.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Hypervisors typically intercept CPUID leaves in the range 0x40000000
to 0x400000ff, with leaf 0x40000000 returning the maximum supported
function within this range in register %eax.
iPXE currently masks off bit 30 from the requested CPUID leaf when
checking to see if a function is supported, which causes this check to
read from leaf 0x00000000 instead of 0x40000000.
Fix by including bit 30 within the mask.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The general syntax for SMBIOS settings:
smbios/<instance>.<type>.<offset>.<length>
is currently extended such that a <length> of zero indicates that the
byte at <offset> contains a string index, and an <offset> of zero
indicates that the <length> contains a literal string index.
Since the byte at offset zero can never contain a string index, and a
literal string index can never have a zero value, the combination of
both <length> and <offset> being zero is currently invalid and will
always return "not found".
Extend the syntax such that the combination of both <length> and
<offset> being zero may be used to read the entire data structure.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Following the example of aws-int13con, add a utility that can be used
to read the INT13 console log from a used iPXE boot disk in Google
Compute Engine.
There seems to be no easy way to directly read the contents of either
a disk image or a snapshot in Google Cloud. Work around this
limitation by creating a snapshot and attaching this snapshot as a
data disk to a temporary Linux instance, which is then used to echo
the INT13 console log to the serial port.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Experiments suggest that using fewer than 64 receive buffers leads to
excessive packet drop rates on some instance types (observed with a
c3-standard-4 instance in europe-west4-a).
Fix by increasing the number of receive data buffers (and adjusting
the length of the registrable queue page address list to match).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Google Virtual Ethernet NIC (GVE or gVNIC) is found only in Google
Cloud instances. There is essentially zero documentation available
beyond the mostly uncommented source code in the Linux kernel.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Following the example of aws-import, add a utility that can be used to
upload an iPXE disk image to Google Compute Engine as a bootable
image. For example:
make CONFIG=cloud EMBED=config/cloud/gce.ipxe \
bin-x86_64-pcbios/ipxe.usb bin-x86_64-efi/ipxe.usb
make CONFIG=cloud EMBED=config/cloud/gce.ipxe \
CROSS=aarch64-linux-gnu- bin-arm64-efi/ipxe.usb
../contrib/cloud/gce-import -p \
bin-x86_64-pcbios/ipxe.usb \
bin-x86_64-efi/ipxe.usb \
bin-arm64-efi/ipxe.usb
The iPXE disk image is automatically wrapped into a tarball containing
a single file named "disk.raw", uploaded to a temporary bucket in
Google Cloud Storage, and used to create a bootable image. The
temporary bucket is deleted after use.
An appropriate image family name is identified automatically: "ipxe"
for BIOS images, "ipxe-uefi-x86-64" for x86_64 UEFI images, and
"ipxe-uefi-arm64" for AArch64 UEFI images. This allows the latest
image within each family to be launched within needing to know the
precise image name.
Google Compute Engine images are globally scoped and are available
(and cached upon first use) in all regions. The initial placement of
the image may be controlled indirectly by using the "--location"
option to specify the Google Cloud Storage location used for the
temporary upload bucket: the image will then be created in the closest
multi-region to the storage location.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The DHCPv6 protocol does not itself provide a router address or a
prefix length. This information is instead obtained from the router
advertisements.
Our IPv6 minirouting table construction logic will first construct an
entry for each advertised prefix, and later update the entry to
include an address assigned within that prefix via stateful DHCPv6 (if
applicable).
This logic fails if the address assigned via stateful DHCPv6 does not
fall within any of the advertised prefixes (e.g. if the network is
configured to use DHCPv6-assigned /128 addresses with no advertised
on-link prefixes). We will currently treat this situation as
equivalent to having a manually assigned address with no corresponding
router address or prefix length: the routing table entry will use the
default /64 prefix length and will not include the router address.
DHCPv6 is triggered only in response to a router advertisement with
the "Managed Address Configuration (M)" or "Other Configuration (O)"
flags set, and a router address is therefore available at the point
that we initiate DHCPv6.
Record the router address when initiating DHCPv6, and expose this
router address as part of the DHCPv6 settings block. This allows the
routing table entry for any address assigned via stateful DHCPv6 to
correctly include the router address, even if the assigned address
does not fall within an advertised prefix.
Also provide a fixed /128 prefix length as part of the DHCPv6 settings
block. When an address assigned via stateful DHCPv6 does not fall
within an advertised prefix, this will cause the routing table entry
to have a /128 prefix length as expected. (When such an address does
fall within an advertised prefix, it will continue to use the
advertised prefix length.)
Originally-fixed-by: Guvenc Gulce <guevenc.guelce@sap.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
In a small subnet (with a /31 or /32 subnet mask), all addresses
within the subnet are valid host addresses: there is no separate
network address or directed broadcast address.
The logic used in iPXE to determine whether or not to use a link-layer
broadcast address will currently fail in these subnets. In a /31
subnet, the higher of the two host addresses (i.e. the address with
all host bits set) will be treated as a broadcast address. In a /32
subnet, the single valid host address will be treated as a broadcast
address.
Fix by adding the concept of a host mask, defined such that an address
in the local subnet with all of the mask bits set to zero represents
the network address, and an address in the local subnet with all of
the mask bits set to one represents the directed broadcast address.
For most subnets, this is simply the inverse of the subnet mask. For
small subnets (/31 or /32) we can obtain the desired behaviour by
setting the host mask to all ones, so that only the local broadcast
address 255.255.255.255 will be treated as a broadcast address.
Originally-fixed-by: Lukas Stockner <lstockner@genesiscloud.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Remove the now-unused generalised text widget user interface, along
with the associated concept of a widget set and the implementation of
a read-only label widget.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Rewrite the code implementing the "login" user interface to use a
predefined interactive form. The command "login" then becomes roughly
equivalent to:
#!ipxe
form
item username Username
item --secret password Password
present
with the result that login form customisations (e.g. to add a Windows
domain name) may be implemented within the scripting language.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for presenting a dynamic user interface as an interactive
form, alongside the existing support for presenting a dynamic user
interface as a menu.
An interactive form may be used to allow a user to input (or edit)
values for multiple settings on a single screen, as a user-friendly
alternative to prompting for setting values via the "read" command.
In the present implementation, all input fields must fit on a single
screen (with no scrolling), and the only supported widget type is an
editable text box.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For interactive forms, the concept of a secret value becomes
meaningful (e.g. for password fields).
Add a flag to indicate that an item represents a secret value, and
allow this flag to be set via the "--secret" option of the "item"
command.
This flag has no meaning for menu items, but is silently accepted
anyway to keep the code size minimal.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Generalise the ability to look up a dynamic user interface item by
index or by shortcut key, to allow for reuse of this code for
interactive forms.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently have an abstract model of a dynamic menu as a list of
items, each of which has a name, a description, and assorted metadata
such as a shortcut key. The "menu" and "item" commands construct
representations in this abstract model, and the "choose" command then
presents the items as a single-choice menu, with the selected item's
name used as the output value.
This same abstraction may be used to model a dynamic form as a list of
editable items, each of which has a corresponding setting name, an
optional description label, and assorted metadata such as a shortcut
key. By defining a "form" command as an alias for the "menu" command,
we could construct and present forms using commands such as:
#!ipxe
form Login to ${url}
item username Username or email address
item --secret password Password
present
or
#!ipxe
form Configure IPv4 networking for ${netX/ifname}
item netX/ip IPv4 address
item netX/netmask Subnet mask
item netX/gateway Gateway address
item netX/dns DNS server address
present
Reusing the same abstract model for both menus and forms allows us to
minimise the increase in code size, since the implementation of the
"form" and "item" commands is essentially zero-cost.
Rename everything within the abstract data model from "menu" to
"dynamic user interface" to reflect this generalisation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for wraparound scrolling and allow the tab key to be used
to move forward through a list of elements, wrapping back around to
the beginning of the list on overflow.
This is mildly useful for a menu, and likely to be a strong user
expectation for an interactive form.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Switch terminology for the "item" command from "item <label> <text>"
to "item <name> <text>", in preparation for repurposing the "item"
command to cover interactive forms as well as menus.
Since this renaming affects only a positional parameter, it does not
break compatibility with any existing scripts.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The msg() and alert() functions currently defined in settings_ui.c
provide a general-purpose facility for printing messages centred on
the screen.
Split this out to a separate file to allow for reuse by the form
presentation code.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The curses concept of a window has been supported but never actively
used in iPXE since the mucurses library was first implemented in 2006.
Simplify the code by removing the ability to place a widget set in a
specified window, and instead use the standard screen for all drawing
operations.
This simplification allows the widget set parameter to be omitted for
the draw_widget() and edit_widget() operations, since the only reason
for its inclusion was to provide access to the specified window.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Create a generic abstraction of a text widget, refactor the existing
editable text box widget to use this abstraction, add an
implementation of a non-editable text label widget, and generalise the
login user interface to use this generic widget abstraction.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
2024-05-15 14:22:01 +01:00
846 changed files with 44080 additions and 14553 deletions
* LoongArch64-specific sanboot API implementations
*
*/
FILE_LICENCE(GPL2_OR_LATER_OR_UBDL);
#endif /* _BITS_SANBOOT_H */
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.