Skip to content

Conversation

@drazisil-codecov
Copy link
Contributor

@drazisil-codecov drazisil-codecov commented Dec 10, 2025

Description

Centralizes retry handling in BaseCodecovTask.retry() and fixes infinite retry bugs in bundle analysis tasks.

Changes:

  • Override BaseCodecovTask.retry() to handle max_retries checks, logging, metrics, and exponential backoff
  • Remove max_retries parameter from LockManager.locked(); delegate retry limits to tasks
  • Remove safe_retry() and _retry() helper methods
  • Add contextual retry logging (repoid, commitid, report_type) nested under context key
  • Fix infinite retry bugs in BundleAnalysisProcessorTask and BundleAnalysisNotifyTask
  • Refactor tests to remove mocks of internal retry() logic and test real behavior

Impact:

  • Prevents infinite retries when max_retries is exceeded
  • Standardizes retry behavior across all tasks
  • Improves observability with structured retry logs
  • Simplifies retry logic by removing redundant helpers

…essor

- Added logic to prevent infinite retries when the maximum retry limit is exceeded.
- Introduced error logging to capture details when max retries are reached.
- Enhanced unit tests to verify behavior for both exceeding and not exceeding max retries.
@sentry
Copy link

sentry bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 95.74468% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.89%. Comparing base (6f76140) to head (8cad4cb).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/base.py 93.93% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #608      +/-   ##
==========================================
+ Coverage   93.87%   93.89%   +0.01%     
==========================================
  Files        1285     1285              
  Lines       46674    46681       +7     
  Branches     1522     1522              
==========================================
+ Hits        43816    43830      +14     
+ Misses       2548     2541       -7     
  Partials      310      310              
Flag Coverage Δ
workerintegration 58.75% <68.08%> (+0.15%) ⬆️
workerunit 91.26% <95.74%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 95.74468% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/base.py 93.93% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Updated the LockManager's locked method to eliminate the max_retries parameter, simplifying retry handling.
- Adjusted related unit tests to reflect the removal of max_retries, ensuring proper logging and behavior for retry scenarios.
- Enhanced error handling to ensure retries are managed through the task's retry method instead.
@drazisil-codecov drazisil-codecov changed the title feat(worker): Implement max retries handling for bundle analysis processor feat(worker): Centralize retry logic in BaseCodecovTask and fix infinite retry bugs Dec 10, 2025
- Introduced a sentinel object to differentiate between "not provided" and "explicitly None" for max_retries.
- Updated the retry method to accommodate unlimited retries when max_retries is set to None, aligning with Celery's behavior.
- Simplified logic for determining the effective max_retries value during task retries.
@drazisil-codecov drazisil-codecov marked this pull request as ready for review December 10, 2025 15:23
current_retries = self.request.retries if hasattr(self, "request") else 0
task_max_retries = max_retries if max_retries is not None else self.max_retries

request = getattr(self, "request", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just return out if self.request is None? The next line is going to freak out trying to call retries on None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed by adding an early check for request is None and handling it gracefully. The code now safely handles the case where request might not exist, using default values instead of potentially accessing attributes on None.

Comment on lines 250 to 253
# Resolve max_retries for checking/logging
# If max_retries is not provided, use self.max_retries from task class
# If max_retries=None is explicitly provided, means unlimited retries (matches Celery behavior)
# self.max_retries is inherited from BaseCodecovTask (TASK_MAX_RETRIES_DEFAULT) or overridden by task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we eliminate some of the comments in this file, it's getting hard to read the code, either aggregate the important chunks at the top of the function definition or just remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Reduced excessive inline comments and consolidated the important information. The code is now much more readable while still maintaining clarity on the key logic.

log = logging.getLogger("worker")

# Sentinel object to distinguish "not provided" from "explicitly None" for max_retries
_NOT_PROVIDED = object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little bit weird to make a comparison to object(). I'd prefer is this were a singleton class or some other primitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Replaced object() with a singleton class _NotProvided which is clearer and more maintainable. This makes the sentinel pattern more explicit and easier to understand.

# If max_retries is not provided, use self.max_retries from task class
# If max_retries=None is explicitly provided, means unlimited retries (matches Celery behavior)
# self.max_retries is inherited from BaseCodecovTask (TASK_MAX_RETRIES_DEFAULT) or overridden by task
if max_retries is _NOT_PROVIDED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the whole point of this is to override a subclasses max_retries, seems like we should just rely on the subclass having the correct max set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! The ability to override max_retries is actually needed in practice. For example, upload_processor.py uses error-specific retry limits (self.retry(max_retries=error.max_retries, ...)) where different errors may have different retry limits than the task default. This provides flexibility for dynamic retry behavior based on error conditions while still defaulting to the task's max_retries when not specified.

Comment on lines 261 to 263
request_kwargs = {}
if request and hasattr(request, "kwargs"):
request_kwargs = request.kwargs if request.kwargs is not None else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to not just be request.kwargs? seems like we don't need to copy the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Simplified kwargs handling - we now use request.kwargs directly when available instead of creating unnecessary copies. The code is cleaner and more efficient.

if request and hasattr(request, "kwargs"):
request_kwargs = request.kwargs if request.kwargs is not None else {}
# Note: kwargs parameter is Celery's special parameter for task kwargs on retry
retry_kwargs = kwargs if kwargs is not None else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just default kwargs in the function def to {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! However, using a mutable default (kwargs={}) is a Python anti-pattern. I kept kwargs=None as the default and handle it in the body, which is safer and follows Python best practices. The code now properly handles both None and empty dict cases.

Comment on lines 313 to 320
# Pass max_retries to Celery:
# - If not provided, use self.max_retries (inherited from BaseCodecovTask or overridden)
# - If explicitly provided (including None), pass as-is (None = unlimited in Celery)
if max_retries is _NOT_PROVIDED:
# Not provided, use task's max_retries (inherited from BaseCodecovTask or overridden)
celery_max_retries = self.max_retries
else:
# Explicitly provided, pass as-is (None means unlimited in Celery, matches superclass behavior)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, this is hard to read with so many comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Removed excessive comments and simplified the code. The important information is now in the docstring, and the implementation is cleaner and easier to read.

f"Task {self.name} exceeded max retries",
extra={
"task_name": self.name,
"context": context if context else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be its own var instead of just explicitly putting the variables that we pull from kwargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, this is the UploadContext. It's being added to the kwargs with this PR so the base class retry logic has access to it for logging. These logs match the structure of the other logs.

Comment on lines 286 to 289
if UploadFlow.has_begun():
UploadFlow.log(UploadFlow.UNCAUGHT_RETRY_EXCEPTION)
if TestResultsFlow.has_begun():
TestResultsFlow.log(TestResultsFlow.UNCAUGHT_RETRY_EXCEPTION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too specific for a base class to have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point! Removed UploadFlow and TestResultsFlow logging from the base class as they are too specific. These flows can handle their own logging in their specific task implementations if needed. The base class should remain generic.

- Changed the logging of context fields to use a nested dictionary for improved clarity.
- Updated error and warning logs to include the new context structure, enhancing the detail of logged information during task retries and max retries exceeded scenarios.
Comment on lines +503 to 513

def test_bundle_analysis_process_upload_rate_limit_error(
caplog,
dbsession,
mocker,
mock_storage,
):
"""Test that rate_limit_error (retryable) causes retry with correct countdown"""
storage_path = (
"v1/repos/testing/ed1bdd67-8fd2-4cdb-ac9e-39b99e4a3892/bundle_report.sqlite"
)

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This bug was already fixed in the implementation. The code initializes result: ProcessingResult | None = None before the try block (line 161 in apps/worker/tasks/bundle_analysis_processor.py) and checks if result is not None: in the finally block (line 218) before accessing result.bundle_report. This ensures that if an exception occurs before result is assigned, the finally block safely skips cleanup without raising a NameError.

"""Sentinel class to distinguish 'not provided' from 'explicitly None' for max_retries"""


_NOT_PROVIDED = _NotProvided()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasrockhu-codecov I don't understand why a named class is not the same as a unnamed class. This is different from how I do singletons in JavaScript. Also, while the comment says "singleton", is that really what we are doing here? The idea is to check for "set vs not-set", because Celery treats None differently from "not set".

Copy link
Contributor Author

@drazisil-codecov drazisil-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Assessment

This is a solid PR that centralizes retry logic and fixes important infinite retry bugs. The approach is clean and well-tested. Here are some suggestions for improvement:

Strengths

✅ Centralizes retry logic in one place
✅ Fixes infinite retry bugs
✅ Good test coverage (95.35%)
✅ Structured logging with context
✅ Removes redundant helper methods

Suggestions for Improvement

1. Missing Test Coverage (2 lines in base.py)

The coverage report indicates 2 lines are missing coverage. Based on the code, these are likely:

  • Lines 254-258: The if request is None: edge case branch
  • Line 262: The hasattr(request, "kwargs") check when request.kwargs is falsy

Recommendation: Add tests for these edge cases to reach 100% coverage.

2. Redundant Check in request.kwargs Handling

request_kwargs = (
    request.kwargs if hasattr(request, "kwargs") and request.kwargs else {}
)

The and request.kwargs check is redundant since hasattr only checks existence. If request.kwargs exists but is None, this would still use {}, which is correct, but the check could be simplified.

Recommendation: Consider simplifying to:

request_kwargs = getattr(request, "kwargs", None) or {}

3. Inconsistent Retry Check in Bundle Analysis

In bundle_analysis_processor.py line 182, there's still a manual self.request.retries < self.max_retries check before calling self.retry(). This is now redundant since self.retry() handles max_retries internally.

Recommendation: Remove the manual check to be consistent with the centralized approach:

if result.error and result.error.is_retryable:
    log.warn(...)
    self.retry(countdown=30 * (2**self.request.retries))

4. Code Duplication in retry() Method

The super().retry() calls at lines 316-330 are nearly identical, differing only by the kwargs parameter.

Recommendation: Simplify to:

retry_kwargs_dict = {"countdown": countdown, "exc": exc, "max_retries": celery_max_retries}
if kwargs:
    retry_kwargs_dict["kwargs"] = kwargs
retry_kwargs_dict.update(other_kwargs)
return super().retry(**retry_kwargs_dict)

5. Documentation Enhancement

The docstring for retry() could be clearer about the sentinel pattern usage.

Recommendation: Add a note explaining why _NOT_PROVIDED is used:

"""
Override Celery's retry() to automatically check max_retries and track metrics.

The max_retries parameter uses a sentinel pattern (_NOT_PROVIDED) to distinguish:
- Not provided: Uses self.max_retries from the task class
- Explicitly None: Unlimited retries (matches Celery behavior)
- Explicit integer: Override the task's max_retries for this retry
"""

6. Potential Edge Case: Empty String Values

The context extraction checks is not None, but doesn't handle empty strings. If commitid="", it would still be included in context.

Recommendation: Consider filtering out empty strings:

if all_kwargs.get("commitid") not in (None, ""):
    context["commitid"] = all_kwargs.get("commitid")

7. Lock Manager Documentation

The docstring for LockManager.locked() mentions that tasks should handle max_retries via self.retry(), but it might be helpful to add an example.

Recommendation: Add a brief example in the docstring showing the pattern:

"""
Raises:
    LockRetry: If lock cannot be acquired, with countdown for retry scheduling.
              The calling task should handle max_retries checking via self.retry().
              
Example:
    try:
        with lock_manager.locked(LockType.UPLOAD, retry_num=self.request.retries):
            # do work
    except LockRetry as e:
        self.retry(countdown=e.countdown)
"""

Minor Issues

  • Line 184 in bundle_analysis_processor.py: log.warn should be log.warning (though warn is deprecated but still works)

Questions

  1. Should the request is None case ever happen in production? If not, should it raise an exception instead of silently using defaults?
  2. Is there a reason to keep the manual max_retries check in bundle_analysis_processor.py, or should it be removed for consistency?

Overall, this is a well-executed refactoring that improves code maintainability and fixes critical bugs. The suggestions above are mostly minor improvements and edge case handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants