ABCDEFGHIJKLMNOPQRSTUVWXY
1
Commit: https://github.com/starlingx-staging/stx-nova/commit/71acfeae0d1c59fdc77704527d763bd85a276f9a
2
Diff stats: 285 changed files with 21,364 additions and 2,334 deletions.
3
Single diff is made of ~130 individual patches.
4
Note: I'm not including trivial things like files changed due to building / packaging the code, or licensing. I have also excluded a few changes which are bug fixes which have or already are being worked upstream. I am reporting bugs upstream for other things they have fixed which have not been reported.
5
6
ChangeScenarioImpactsUpstream EquivalentRecommendation
7
8
CPU scaling: this makes up quite a few changes throughout nova, but it is basically just live resize. Example: user creates a server with 4 VCPU (and that is what is tracked against their quota), then resizes to 2 VCPU but still "reserves" the other 2 "offline" VCPUs on the host so they can resize back up to 4 without hitting resource claim failures later.Pet VMs with unpredictable load which do not scale well horizontally need to be able to resize up and down at will.Lots of changes, complex, high risk. https://review.openstack.org/#/c/141219/; note this does not support resizing downSupport the proposed live resize spec if we want this.
9
Fix how volume-backed instances root_gb usage is tracked.Very small edge nodes where resources are limited.Known issue upstream for a long time, finally fixed in Rocky when using Placement.https://review.openstack.org/#/q/topic:bug/1469179+status:mergedAlready fixed upstream.
10
Support pinned and shared CPUs on the same host.Very small edge nodes with 1 or 2 compute hosts where host aggregates for separation are impractical.Complexity in resource tracking and hardware calculations during scheduling.None; operators use host aggregates to separate hosts/flavors for pinned CPUs from hosts with CPU overcommit (traditional data center or public cloud).

Jay Pipes does have a related spec though: https://review.openstack.org/#/c/555081/
Support the proposed live resize spec if we want this.
11
API changes based on a custom "wrs-header" header which makes some extensions return additional data. Seen in: os-hypervisors, os-servers-groups, servers. Shows things like server numa topology, pci devices, hypervisor NUMA node and L3 CAT stats, custom server group metadata, more detalied port information.Depends on the feature.Interoperability issues but only if the wrs-header is provided.n/aSome changes might be worth doing upstream if we need them, e.g. showing numa topology or pci devices for a server.
12
Flavor extra spec validation.Usability.None.This was discussed at the Rocky summit when thinking about ways to improve usability for advanced features since it is easy to mess up flavor extra specs since there is no API schema for them.Do something similar upstream if we want this.
13
Per instance live migration timeout.MaintenanceNone.Previously approved spec that was not implemented: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/live-migration-per-instance-timeout.htmlRe-propose previously approved specs if we want this.
14
Per instance live migration max downtime. Similar to per-instance live migration timeout, this just allows overriding the per-compute config for live migration downtime.MaintenanceNone.n/aPropose a spec if we want this.
15
Per-instance live migration auto_converge and tunneled flag based on flavor extra spec/image metadata/server metadata.MaintenanceNone.n/a; seems weird to put this into the flavor or image when it seems it should be part of the live migration server action API instead. Maybe it's only used with host aggregates depending on how certain hosts are configured for live migration (to support auto-converge or tunneling?).Propose a spec if we want this.
16
Some sort of special "recovery_priority" metadata, must be used by their Virtual Infrastructure Manager (VIM) tooling.AvailabilityNone.Masakari project? https://docs.openstack.org/masakari/latest/
n/a
17
Block create/update/delete API operations during upgrade based on whether or not [upgrade_levels]/compute is set in the API for compatibility mode with older nova-compute services (during rolling upgrade).MaintenanceAPI is basically read-only during upgrade which could mean long downtime in a large deployment.The compute API should be natively aware of rolling upgrades and handle back-level computes. This includes scenarios such as not being able to use new features during server create if there are no computes that support the feature yet (we can also handle this with a scheduler filter). For existing servers, the API can detect if the compute host on which the server is running is too old to support some operation and fail with a 409 response (like volume multiattach or virtual device tags).No action; the compute API should natively handle upgrades. Report bugs if it does not.
18
Limit the number of attached ports per instance to 16.NFVn/aNone.None - looks like this was added as a hard-coded cap based on performance testing.
19
Allow passing vif_model when attaching a port to a server.NFVIncreased complexity in the compute API.Upstream spec: https://review.openstack.org/#/c/362287/ - Create the port directly in neutron with the desired vif_model and then provide the port to nova.Do not add more proxy API complexity to nova.
20
During port attach, validate that the port's provider:physical_network matches the provider physnet of the compute host (based on host aggregate metadata to group hosts by physical network).Usability.Increased complexity in the compute API and deployment requirements for modeling with host aggregates.None.Seems like a good idea but would require more thought on alternatives.
21
Disallow creating/updating/removing extra specs on a flavor that is in use (more than one instance is using the flavor).Usability?Potentially a lot of semi-duplicate or orphaned flavors.Flavors are persisted per server so the flavor that was used to create a server (or resize it) is stored with the server, so changing the original flavor's extra specs should not matter as much.Seems unnecessary.
22
Allow specifying hw:cpu_model in flavor extra specs.NFV/HPCInteroperability and scheduling complexity.Spec was proposed upstream and rejected: https://review.openstack.org/#/c/168982/. Could potentially be replicated with CPU traits: https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/report-cpu-features-as-traits.htmlNone; review spec if interested.
23
Ability to configure flavors to partition the host CPU L3 cache.NFV/HPCExtremely complicated resource tracking which is biased toward Intel CAT technology. Depends on newer libvirt.Intel proposed spec: https://review.openstack.org/#/c/568678/Avoid.
24
Server group messaging. Creates a channel between the guests and their host in a server group so that guests can send messages to each other. Compute would cast the messages to conductor which would then broadcast the message to all servers in the group.n/a?Additional complexity to nova for something that should live outside of nova. Potentially a security concern.None. Chris Friesen from WindRiver said it turned out to be much more of a burden than it was to maintain, and they are likely going to remove it.Avoid.
25
The os-instance-actions API was modified to show failed event details which is the exception value (the message).Serviceability?Interoperability. Would require a microversion upstream.The os-instance-actions API already shows the traceback for failure events for admin users so the exception value is redundant unless it is shown for non-admin users.Ignore.
26
Added an admin API GET /os-quota-sets for admins to list all modified quotas across all projects and users.AdministrationInteroperability. Would require a microversion upstream.NonePropose a spec if we want this.
27
Forked patch for bug https://bugs.launchpad.net/nova/+bug/1552622 - race during quota update.UsabilityNone.Upstream patch was abandoned.Restore the patch if we want this fixed upstream.
28
Server group API additions for (1) limiting the number of servers that can be in a group and (2) a "best effort" key which predates the soft-(anti-)affinity weighers.NFVNone; changes are based on the previously unused server group metadata entries.The soft-(anti-)affinity policies in microversion 2.15 cover the "best effort" key and "server_group_members" quota limits the number of servers per group (granted, it's global and not possible to specify per group, but you could allow per-tenant overrides?).Propose a spec if we want this.
29
wrs-providernet API extension: API which shows PF and VF inventory and usage from compute nodes configured/tagged for a given provider:physical_network.NFV/Administration/UsabilityNone; very deployment specific.NonePropose a spec if we want this.
30
Change when listing servers with details to not show security group information since it requires getting ports for each server from neutron which is a lot of load on neutron if listing 1000 servers.PerformanceNoneThere is a bug for this upstream: https://bugs.launchpad.net/nova/+bug/1567655Fix the bug upstream, potentially using a caching mechanism like the instance network info_cache. I know this has also been reported by the internal ECS performance team. This would likely require a blueprint since it's not trivial.
31
os-server-groups API: allow admin to create server group for other tenantsn/aInteroperabilityNonePropose a spec if we want this.
32
Added POST /os-services API to create nova-compute service on a given host.n/aInteroperabilityRestart the nova-compute process on the host to re-create the service in the database if it's gone. If the admin wants to start the compute auto-disabled for testing, they can use the "enable_new_services" config option.Avoid; not really sure why this was added or needed.
33
Server action "suspend" and "resume" APIs changed to actually call "pause" and "unpause" compute APIs.n/aInteroperabilityNoneIgnore.
34
Added wrs-pci API extension: used to show pci device configuration and usage on compute hosts.NFV/Administration/UsabilityInteroperabilityThere was an unused os-pci API in nova from Juno but since it was never exposed it was removed: http://lists.openstack.org/pipermail/openstack-operators/2017-March/012970.htmlPropose a spec if we want this.
35
Specify "wrs-header" when showing/listing servers with details to get the name/uuid of any group the server is in.NFV/UsabilityInteroperability; likely poor performance when listing 1000 servers with details.NonePropose a spec if we want this. We could store the server group information with the instance in the instance_extra table, or add a "member" query filter to the GET /os-server-groups API to filter groups returned by server members.
36
Added nova-manage db purge_deleted_instances CLIAdministrationNonenova-manage db purge command added upstream in Rocky: https://blueprints.launchpad.net/nova/+spec/purge-dbAlready done upstream.
37
Allow creating a server with a specific vif_model which sets a scheduler hint for non-physical (SRIOV) interfaces, which is used to find an appropriate host in a provider network aggregate.NFVComplex, interoperability
None; create the port with the specific vif_model and then provide that port to the user. There is admittedly a gap in network-aware scheduling in nova, which is tracked with this older spec: https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/prep-for-network-aware-scheduling-pike.htmlRevive and work on the network-aware scheduling spec if want to close the usability gap on scheduling to specific networks. Also has overlap with neutron routed networks.
38
API does not allow deleting a server that is being resized (cold migrated). Likely due to their live resize / CPU scaling addition.n/aInteroperabilityNoneIgnore.
39
Does not support the "force" flag in the live migrate API so that the request always goes through the scheduler.AdministrationInteroperabilityNoneNot a bad idea, forcing a host and bypassing the scheduler during live migration can cause problems with resource tracking.
40
Some robustness enhancements for things like retrying DB updates when nova-conductor is under stress.Serviceability / Edge?NoneOperational best practices to adequately scale out nova-conductor and add load balancers.Ignore.
41
Added support for cold migrate / resize with ephemeral root disk on LVM.n/aNoneChange was proposed upstream: https://review.openstack.org/#/c/337334/Restore and fix the change upstream.
42
Support live migration of instances with pinned CPUs.NFV/HPC/MaintenanceComplex, implemented separately from how this will be done upstream with Placement.Cherry-picked from upstream work: https://review.openstack.org/#/c/244489/Supposed to be fixed upstream as part of https://blueprints.launchpad.net/nova/+spec/numa-aware-live-migration
43
Added periodic task to cleanup _del/_res files from a host that was evacuated while there were pending resizes on it.Administration/MaintenanceLowNoneReport a bug upstream if we have the same issue.
44
Added periodic task to destroy guests running on the hypervisor which are not found in the nova database (maybe the user deleted the VM while the host was down?).MaintenanceLowNone? I thought we already had something in nova-compute that did this (_cleanup_running_deleted_instances). See https://bugs.launchpad.net/nova/+bug/1285000.Report a bug upstream if we have the same issue.
45
Lots of additional debug logging built into the compute ResourceTracker.ServiceabilityNoneNoneCould be upstreamed.
46
RPC call for post_live_migration_at_destination timeout changed to 120 to account for servers with up to 16 attached ports.NFV/Maintenance/ScaleNoneCould be implemented using the new "long_rpc_timeout" config option in Rocky like we use in pre_live_migration.Proposed upstream: https://review.openstack.org/#/c/588668/
47
Live migrations for servers in a strict anti-affinity group are serialized to avoid races where we could (incorrectly) live migrate those servers to the same host. Also adds request_spec.group.hosts for in-progress live migrations for other members of the same anti-affinity group (similar for other moves).Maintenance/HardeningNoneNone; we don't perform any late anti-affinity checks on the compute during live migration like we do for server create and evacuate. Related bug: https://bugs.launchpad.net/nova/+bug/1600251Report a bug upstream if we have the same issue.
48
Added a [cinder]/session_retries config option to nova to allow configuring nova<>cinder to retry on 500 errors (by default up to 4 times).Edge/HardeningNoneThe [cinder]/http_retries option might do the same thing now via keystoneauth1.Ignore.
49
Added "max_concurrent_migrations" config option which is used to serialize cold migrations on a single host, similar to "max_concurrent_live_migrations" option.Edge/HardeningNoneNoneCould be upstreamed.
50
Added "concurrent_disk_operations" config option to restrict the number of concurrent disk I/O intensive operations (image download, image conversion, live snapshot) on a compute host. Defaults to 2.Edge/HardeningCould unnecessarily block operations and timeout the user token (that would not be a problem if configured to use service user tokens starting in Pike).NoneCould be upstreamed.
51
Image download is threaded out with eventlet to avoid nova-compute getting stuck on disk I/O.Edge/HardeningNone?NoneCould be upstreamed.
52
Added support for creating thinly provisioned LVM-backed ephemeral root disks.Edge (restricted disk space)Potentially over-subscribe the disk on a host and run out of space.Nova supports sparse logical volumes with the libvirt LVM image backend, but the feature has been deprecated in Rocky: https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.sparse_logical_volumes A blueprint was proposed at one point for thin LVM support https://review.openstack.org/#/c/442126/ but there was pushback because we had the sparse LV feature...Could be upstreamed.
53
Made the [pci]/alias config option mutable so it can be changed and reloaded without restarting the service.Administration/UsabilityWould have to audit where the option is used to make sure reloading the config is done properly.NoneCould be upstreamed.
54
Added "rounded_weight" sheduler config option to randomly pick from groups of same-weight hosts as opposed to "host_subset_size" which randomly picks a host regardless of weight.n/aNoneNoneCould be upstreamed.
55
Added command to purge API DB records related to purged instances in cell DBs.MaintenanceNonePartially fixed: https://review.openstack.org/#/c/515034/ - Still need to fix: https://bugs.launchpad.net/nova/+bug/1778804Already done upstream.
56
Built-in scheduling logging for NoValidHost debugging and causal analysis.Administration/HardeningNoneSpec was previously approved upstream but stalled on implementation details: https://specs.openstack.org/openstack/nova-specs/specs/newton/approved/improve-sched-logging.htmlRevive and work on this upstream. This is still a major usability issue in Nova and is actually worse now with Placement: http://lists.openstack.org/pipermail/openstack-dev/2018-August/132735.html
57
Whitelists which types of vifs can be attached/detached so that the compute API won't let you try to attach SR-IOV ports.NFV/UsabilityInteroperability; likely very biased and hard to maintain upstream given the plethora of vif types in Neutron.NoneIgnore.
58
Removed support for deferred IPs which are used for routed networks. Not sure why. Maybe because they group hosts by provider network and then filter hosts by network, so kind of like routed networks.ScalingInteroperabilityhttps://docs.openstack.org/neutron/latest/admin/config-routed-networks.htmln/a
59
RequestSpec.numa_topology is reset during move operations since it could be stale after a resize to a different flavor with a different numa configuration.Bug fixNoneNone? https://bugs.launchpad.net/nova/+bug/1763766 might be related.Report a bug upstream if we have the same issue.
60
Has scheduler weigher tooling to mark compute hosts as being "patch current" and hosts that are "upgrade current" over hosts that are not. Relies on an external patch-tracking service.Maintenance/HardeningNone; this is all internal plumbing, infrastructure tooling used by an external system management tool for Titanium Cloud.None - this is heavily deployment tooling specific.Unsure - might be a worthwhile exercise for the Upgrade SIG to understand how this works, although the Upgrade SIG is apparently semi-defunct.
61
A platform.conf exists on each host to indicate if it is "lowlatency". This is used to execute a script which sets CPU wakeup latency on the host during various operations like live migration and live resize but only for instances with a numa requirement. I guess this is used to control / optimize power consumption on the host?Edge?n/aNonen/a
62
Additional CPU realtime validation.Usability/HardeningNoneNoneCould be upstreamed. Getting all of this configuration correct is error-prone so we should do a better job of validating this.
63
Post live migration failures result in calling the rollback live migrate routine.Maintenance/Administration/HardeningComplex, potentially hard to do properly in all cases depending on what failed.None; there was talk back around ~Newton timeframe for re-architecting live migration to use tasks so we could more easily rollback failed operations, but that never got very far.Report a bug upstream if we have the same issue, but would likely need to be specific about what part of post-live migration failed to determine what needs to be rolled back.
64
Preserve UEFI variable store file across reboot, live snapshot and cold migrate.Bug fix.Nonehttps://bugs.launchpad.net/nova/+bug/1785123 and https://bugs.launchpad.net/nova/+bug/1633447Fix the bugs upstream.
65
Uses ionice -c2 -n4 when formatting disk images and extracting snapshots.Edge/HardeningNoneNoneCould be upstreamed.
66
Added "sw:wrs:guest:heartbeat" flavor extra spec for setting up a channel between the host and guest for heartbeats. Presumably related to recovery/instance HA support.Edge/HardeningSecurity (channels between host and guest)Unsure how this is different from the qemu guest agent with the watchdog behavior extra spec. Maybe their version is used by an external tool?n/a
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100