Fleet and Elastic Agent 8.18.7
editFleet and Elastic Agent 8.18.7
editKnown issues
editFailed upgrades leave Elastic Agent stuck until restart
This known issue applies to Elastic Agent 8.18.7 and 9.0.7. Elastic Agent versions 8.19.x and 9.1.x are not affected.
On September 17, 2025, a known issue was discovered that can cause Elastic Agent upgrades to get stuck if an upgrade attempt fails under specific conditions. This happens because the coordinator’s overrideState
remains set, leaving the agent in a state that appears to be upgrading.
Conditions
This issue is triggered if the upgrade fails during one of the early checks inside Coordinator.Upgrade
, for example:
- The agent is not upgradeable
- Capabilities check denies the upgrade
- When Elastic Agent is tamper-protected, Endpoint must validate that the upgrade action was correctly signed by Kibana to allow the upgrade. If the signature is missing, invalid, or the connection between Elastic Agent and Endpoint was interrupted, the validation fails. This causes the agent coordinator’s override state to become stuck until the agent is restarted.
Symptoms
- Fleet shows the upgrade action in progress, even though the upgrade remains stuck
- No further upgrade attempts succeed
- Elastic Agent status shows an override state indicating upgrade
Workaround
Restart the Elastic Agent to clear the coordinator’s overrideState
and allow new upgrade attempts to proceed.
Resolution
This issue was fixed in #9992, which ensures that the coordinator clears its override state whenever an early failure occurs.
The fix is included in versions 9.1.4 and 8.19.4, and planned for versions 9.0.8 and 8.18.8.
fleet-agents template is missing mappings
Details
On May 2, 2025 a known issue was discovered that the .fleet-agents
index template was missing a mapping for the local_metadata.complete
attribute. This may cause agent checkins to be rejected and the agents to appear as offline.
In this Fleet’s logs this will appear as:
elastic fail 400: document_parsing_exception: [1:209] object mapping for [local_metadata] tried to parse field [local_metadata] as object, but found a concrete value Eat bulk checkin error; Keep on truckin'
And in the Elastic Agent logs it will appear as:
"log.level":"error","@timestamp":"2025-04-22:12:35:25.295Z","message":"Eat bulk checkin error; Keep on truckin'","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"service.type":"fleet-server","error.message":"elastic fail 400: document_parsing_exception: [1:209] object mapping for [local_metadata] tried to parse field [local_metadata] as object, but found a concrete value","ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"
This attribute was added to the template in versions: 8.17.11 8.18.3, and 8.19.3.
Further investigation revealed that the .fleet-agents
index template was not correctly applied due to an unchanged _meta.managed_index_mappings_version
number.
This change also affects other attributes as well, such as upgrade_attempts
, namespaces
, unprivileged
, and unhealthy_reason
.
If there is an error related to any of these attributes, there will be a similar error message in the logs.
Impact
Updating to a version with a fixed _meta.managed_index_mappings_version
will correctly apply the new index template.
The fixed versions are 8.18.8, 8.19.4, 9.0.8, 9.1.4.
New features and enhancements
edit- Elastic Agent
Bug fixes
edit- Elastic Agent
-
- Redact secrets from pre-config, computed-config, components-expected, and components-actual files in diagnostics archive. #9560
- Retry service start command upon failure with 30-second delay. #9313
- Fix reporting of scheduled upgrade details across restarts and cancels. #9562 #8778
- Enable root user to re-enroll unprivileged agent for mac and linux. #9603 #8544
- Fix missing liveness healthcheck during container enrollment. #9612 #9611
- Enable admin user to re-enroll unprivileged agent for windows. #9623 #8544
- Treat exit code 284 from Endpoint binary as non-fatal. #9687
- Ensure failed upgrade actions are removed from queue and details are set. #9634 #9629
- Fleet Server
-
-
Restore connection limiter. #5372
Restore connection level limiter to prevent OOM incidents. This limiter is used in addition to the request-level throttle so that once our in-flight requests reaches max_connections a 429 is returned, but if the total connections the server uses is over max_connections*1.1 the server drops the connection before the TLS handshake.
- Build fleet-server as fully static binary to restore OS matrix compatibility. #5392 #5262
-