feat(worker): Centralize retry logic in BaseCodecovTask and fix infinite retry bugs #608

drazisil-codecov · 2025-12-10T12:38:43Z

Description

Centralizes retry handling in BaseCodecovTask.retry() and fixes infinite retry bugs in bundle analysis tasks.

Changes:

Override BaseCodecovTask.retry() to handle max_retries checks, logging, metrics, and exponential backoff
Remove max_retries parameter from LockManager.locked(); delegate retry limits to tasks
Remove safe_retry() and _retry() helper methods
Add contextual retry logging (repoid, commitid, report_type) nested under context key
Fix infinite retry bugs in BundleAnalysisProcessorTask and BundleAnalysisNotifyTask
Refactor tests to remove mocks of internal retry() logic and test real behavior

Impact:

Prevents infinite retries when max_retries is exceeded
Standardizes retry behavior across all tasks
Improves observability with structured retry logs
Simplifies retry logic by removing redundant helpers

…essor - Added logic to prevent infinite retries when the maximum retry limit is exceeded. - Introduced error logging to capture details when max retries are reached. - Enhanced unit tests to verify behavior for both exceeding and not exceeding max retries.

sentry · 2025-12-10T12:46:20Z

Codecov Report

❌ Patch coverage is 95.74468% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.89%. Comparing base (6f76140) to head (8cad4cb).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/base.py	93.93%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #608      +/-   ##
==========================================
+ Coverage   93.87%   93.89%   +0.01%     
==========================================
  Files        1285     1285              
  Lines       46674    46681       +7     
  Branches     1522     1522              
==========================================
+ Hits        43816    43830      +14     
+ Misses       2548     2541       -7     
  Partials      310      310

Flag	Coverage Δ
workerintegration	`58.75% <68.08%> (+0.15%)`	⬆️
workerunit	`91.26% <95.74%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codecov-notifications · 2025-12-10T12:46:23Z

Codecov Report

❌ Patch coverage is 95.74468% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/base.py	93.93%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Updated the LockManager's locked method to eliminate the max_retries parameter, simplifying retry handling. - Adjusted related unit tests to reflect the removal of max_retries, ensuring proper logging and behavior for retry scenarios. - Enhanced error handling to ensure retries are managed through the task's retry method instead.

…ockManager tests

…fy tests

- Introduced a sentinel object to differentiate between "not provided" and "explicitly None" for max_retries. - Updated the retry method to accommodate unlimited retries when max_retries is set to None, aligning with Celery's behavior. - Simplified logic for determining the effective max_retries value during task retries.

apps/worker/services/lock_manager.py

thomasrockhu-codecov · 2025-12-10T20:47:18Z

apps/worker/tasks/base.py

-        current_retries = self.request.retries if hasattr(self, "request") else 0
-        task_max_retries = max_retries if max_retries is not None else self.max_retries
-
+        request = getattr(self, "request", None)


can this just return out if self.request is None? The next line is going to freak out trying to call retries on None

Good catch! Fixed by adding an early check for request is None and handling it gracefully. The code now safely handles the case where request might not exist, using default values instead of potentially accessing attributes on None.

thomasrockhu-codecov · 2025-12-10T20:47:51Z

apps/worker/tasks/base.py

+        # Resolve max_retries for checking/logging
+        # If max_retries is not provided, use self.max_retries from task class
+        # If max_retries=None is explicitly provided, means unlimited retries (matches Celery behavior)
+        # self.max_retries is inherited from BaseCodecovTask (TASK_MAX_RETRIES_DEFAULT) or overridden by task


can we eliminate some of the comments in this file, it's getting hard to read the code, either aggregate the important chunks at the top of the function definition or just remove

Good point! Reduced excessive inline comments and consolidated the important information. The code is now much more readable while still maintaining clarity on the key logic.

thomasrockhu-codecov · 2025-12-10T20:49:04Z

apps/worker/tasks/base.py

 log = logging.getLogger("worker")

+# Sentinel object to distinguish "not provided" from "explicitly None" for max_retries
+_NOT_PROVIDED = object()


this is a little bit weird to make a comparison to object(). I'd prefer is this were a singleton class or some other primitive

Good suggestion! Replaced object() with a singleton class _NotProvided which is clearer and more maintainable. This makes the sentinel pattern more explicit and easier to understand.

thomasrockhu-codecov · 2025-12-10T20:50:01Z

apps/worker/tasks/base.py

+        # If max_retries is not provided, use self.max_retries from task class
+        # If max_retries=None is explicitly provided, means unlimited retries (matches Celery behavior)
+        # self.max_retries is inherited from BaseCodecovTask (TASK_MAX_RETRIES_DEFAULT) or overridden by task
+        if max_retries is _NOT_PROVIDED:


I feel like the whole point of this is to override a subclasses max_retries, seems like we should just rely on the subclass having the correct max set?

Good question! The ability to override max_retries is actually needed in practice. For example, upload_processor.py uses error-specific retry limits (self.retry(max_retries=error.max_retries, ...)) where different errors may have different retry limits than the task default. This provides flexibility for dynamic retry behavior based on error conditions while still defaulting to the task's max_retries when not specified.

thomasrockhu-codecov · 2025-12-10T20:50:26Z

apps/worker/tasks/base.py

+        request_kwargs = {}
+        if request and hasattr(request, "kwargs"):
+            request_kwargs = request.kwargs if request.kwargs is not None else {}


does this need to not just be request.kwargs? seems like we don't need to copy the data

Good catch! Simplified kwargs handling - we now use request.kwargs directly when available instead of creating unnecessary copies. The code is cleaner and more efficient.

thomasrockhu-codecov · 2025-12-10T20:51:35Z

apps/worker/tasks/base.py

+        if request and hasattr(request, "kwargs"):
+            request_kwargs = request.kwargs if request.kwargs is not None else {}
+        # Note: kwargs parameter is Celery's special parameter for task kwargs on retry
+        retry_kwargs = kwargs if kwargs is not None else {}


can we just default kwargs in the function def to {}

Thanks for the suggestion! However, using a mutable default (kwargs={}) is a Python anti-pattern. I kept kwargs=None as the default and handle it in the body, which is safer and follows Python best practices. The code now properly handles both None and empty dict cases.

thomasrockhu-codecov · 2025-12-10T20:53:56Z

apps/worker/tasks/base.py

+        # Pass max_retries to Celery:
+        # - If not provided, use self.max_retries (inherited from BaseCodecovTask or overridden)
+        # - If explicitly provided (including None), pass as-is (None = unlimited in Celery)
+        if max_retries is _NOT_PROVIDED:
+            # Not provided, use task's max_retries (inherited from BaseCodecovTask or overridden)
+            celery_max_retries = self.max_retries
+        else:
+            # Explicitly provided, pass as-is (None means unlimited in Celery, matches superclass behavior)


same, this is hard to read with so many comments

Agreed! Removed excessive comments and simplified the code. The important information is now in the docstring, and the implementation is cleaner and easier to read.

thomasrockhu-codecov · 2025-12-10T20:54:47Z

apps/worker/tasks/base.py

                f"Task {self.name} exceeded max retries",
                extra={
-                    "task_name": self.name,
+                    "context": context if context else None,


does this need to be its own var instead of just explicitly putting the variables that we pull from kwargs

To clarify, this is the UploadContext. It's being added to the kwargs with this PR so the base class retry logic has access to it for logging. These logs match the structure of the other logs.

thomasrockhu-codecov · 2025-12-10T20:55:17Z

apps/worker/tasks/base.py

+            if UploadFlow.has_begun():
+                UploadFlow.log(UploadFlow.UNCAUGHT_RETRY_EXCEPTION)
+            if TestResultsFlow.has_begun():
+                TestResultsFlow.log(TestResultsFlow.UNCAUGHT_RETRY_EXCEPTION)


this is too specific for a base class to have

Excellent point! Removed UploadFlow and TestResultsFlow logging from the base class as they are too specific. These flows can handle their own logging in their specific task implementations if needed. The base class should remain generic.

…Task

- Changed the logging of context fields to use a nested dictionary for improved clarity. - Updated error and warning logs to include the new context structure, enhancing the detail of logged information during task retries and max retries exceeded scenarios.

drazisil-codecov · 2025-12-11T12:45:05Z

apps/worker/tasks/tests/unit/test_bundle_analysis_processor_task.py

+
+def test_bundle_analysis_process_upload_rate_limit_error(
+    caplog,
+    dbsession,
+    mocker,
+    mock_storage,
 ):
+    """Test that rate_limit_error (retryable) causes retry with correct countdown"""
    storage_path = (
        "v1/repos/testing/ed1bdd67-8fd2-4cdb-ac9e-39b99e4a3892/bundle_report.sqlite"
    )


Good catch! This bug was already fixed in the implementation. The code initializes result: ProcessingResult | None = None before the try block (line 161 in apps/worker/tasks/bundle_analysis_processor.py) and checks if result is not None: in the finally block (line 218) before accessing result.bundle_report. This ensures that if an exception occurs before result is assigned, the finally block safely skips cleanup without raising a NameError.

drazisil-codecov · 2025-12-11T16:21:11Z

apps/worker/tasks/base.py

+    """Sentinel class to distinguish 'not provided' from 'explicitly None' for max_retries"""
+
+
+_NOT_PROVIDED = _NotProvided()


@thomasrockhu-codecov I don't understand why a named class is not the same as a unnamed class. This is different from how I do singletons in JavaScript. Also, while the comment says "singleton", is that really what we are doing here? The idea is to check for "set vs not-set", because Celery treats None differently from "not set".

drazisil-codecov

Overall Assessment

This is a solid PR that centralizes retry logic and fixes important infinite retry bugs. The approach is clean and well-tested. Here are some suggestions for improvement:

Strengths

✅ Centralizes retry logic in one place
✅ Fixes infinite retry bugs
✅ Good test coverage (95.35%)
✅ Structured logging with context
✅ Removes redundant helper methods

Suggestions for Improvement

1. Missing Test Coverage (2 lines in `base.py`)

The coverage report indicates 2 lines are missing coverage. Based on the code, these are likely:

Lines 254-258: The if request is None: edge case branch
Line 262: The hasattr(request, "kwargs") check when request.kwargs is falsy

Recommendation: Add tests for these edge cases to reach 100% coverage.

2. Redundant Check in `request.kwargs` Handling

request_kwargs = (
    request.kwargs if hasattr(request, "kwargs") and request.kwargs else {}
)

The and request.kwargs check is redundant since hasattr only checks existence. If request.kwargs exists but is None, this would still use {}, which is correct, but the check could be simplified.

Recommendation: Consider simplifying to:

request_kwargs = getattr(request, "kwargs", None) or {}

3. Inconsistent Retry Check in Bundle Analysis

In bundle_analysis_processor.py line 182, there's still a manual self.request.retries < self.max_retries check before calling self.retry(). This is now redundant since self.retry() handles max_retries internally.

Recommendation: Remove the manual check to be consistent with the centralized approach:

if result.error and result.error.is_retryable:
    log.warn(...)
    self.retry(countdown=30 * (2**self.request.retries))

4. Code Duplication in `retry()` Method

The super().retry() calls at lines 316-330 are nearly identical, differing only by the kwargs parameter.

Recommendation: Simplify to:

retry_kwargs_dict = {"countdown": countdown, "exc": exc, "max_retries": celery_max_retries}
if kwargs:
    retry_kwargs_dict["kwargs"] = kwargs
retry_kwargs_dict.update(other_kwargs)
return super().retry(**retry_kwargs_dict)

5. Documentation Enhancement

The docstring for retry() could be clearer about the sentinel pattern usage.

Recommendation: Add a note explaining why _NOT_PROVIDED is used:

"""
Override Celery's retry() to automatically check max_retries and track metrics.

The max_retries parameter uses a sentinel pattern (_NOT_PROVIDED) to distinguish:
- Not provided: Uses self.max_retries from the task class
- Explicitly None: Unlimited retries (matches Celery behavior)
- Explicit integer: Override the task's max_retries for this retry
"""

6. Potential Edge Case: Empty String Values

The context extraction checks is not None, but doesn't handle empty strings. If commitid="", it would still be included in context.

Recommendation: Consider filtering out empty strings:

if all_kwargs.get("commitid") not in (None, ""):
    context["commitid"] = all_kwargs.get("commitid")

7. Lock Manager Documentation

The docstring for LockManager.locked() mentions that tasks should handle max_retries via self.retry(), but it might be helpful to add an example.

Recommendation: Add a brief example in the docstring showing the pattern:

"""
Raises:
    LockRetry: If lock cannot be acquired, with countdown for retry scheduling.
              The calling task should handle max_retries checking via self.retry().
              
Example:
    try:
        with lock_manager.locked(LockType.UPLOAD, retry_num=self.request.retries):
            # do work
    except LockRetry as e:
        self.retry(countdown=e.countdown)
"""

Minor Issues

Line 184 in bundle_analysis_processor.py: log.warn should be log.warning (though warn is deprecated but still works)

Questions

Should the request is None case ever happen in production? If not, should it raise an exception instead of silently using defaults?
Is there a reason to keep the manual max_retries check in bundle_analysis_processor.py, or should it be removed for consistency?

Overall, this is a well-executed refactoring that improves code maintainability and fixes critical bugs. The suggestions above are mostly minor improvements and edge case handling.

…ve logging in BundleAnalysisProcessorTask

drazisil-codecov changed the title ~~feat(worker): Implement max retries handling for bundle analysis processor~~ feat(worker): Centralize retry logic in BaseCodecovTask and fix infinite retry bugs Dec 10, 2025

drazisil-codecov added 5 commits December 10, 2025 09:28

refactor(tests): Remove redundant comments regarding max_retries in L…

0a3935c

…ockManager tests

refactor(tests): Remove max_retries tests from LockManager unit tests

40245d1

refactor(tests): Simplify task initialization in bundle analysis noti…

b12f436

…fy tests

refactor(worker): Simplify request handling in BaseCodecovTask

eec9ac4

drazisil-codecov marked this pull request as ready for review December 10, 2025 15:23

thomasrockhu-codecov requested changes Dec 10, 2025

View reviewed changes

drazisil-codecov requested a review from thomasrockhu-codecov December 11, 2025 12:31

drazisil-codecov added 2 commits December 11, 2025 07:32

refactor(worker): Improve logging and request handling in BaseCodecov…

77eccb6

…Task

sentry bot reviewed Dec 11, 2025

View reviewed changes

refactor(worker): Improve result handling in BundleAnalysisProcessorTask

f7a1889

drazisil-codecov commented Dec 11, 2025

View reviewed changes

refactor(worker): Streamline retry logic in BaseCodecovTask and impro…

8cad4cb

…ve logging in BundleAnalysisProcessorTask

		"""Sentinel class to distinguish 'not provided' from 'explicitly None' for max_retries"""


		_NOT_PROVIDED = _NotProvided()

feat(worker): Centralize retry logic in BaseCodecovTask and fix infinite retry bugs #608

Are you sure you want to change the base?

feat(worker): Centralize retry logic in BaseCodecovTask and fix infinite retry bugs #608

Uh oh!

Conversation

drazisil-codecov commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

sentry bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov-notifications bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drazisil-codecov left a comment

Choose a reason for hiding this comment

Overall Assessment

Strengths

Suggestions for Improvement

1. Missing Test Coverage (2 lines in base.py)

2. Redundant Check in request.kwargs Handling

3. Inconsistent Retry Check in Bundle Analysis

4. Code Duplication in retry() Method

5. Documentation Enhancement

6. Potential Edge Case: Empty String Values

7. Lock Manager Documentation

Minor Issues

Questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

drazisil-codecov commented Dec 10, 2025 •

edited

Loading

sentry bot commented Dec 10, 2025 •

edited

Loading

codecov-notifications bot commented Dec 10, 2025 •

edited

Loading

1. Missing Test Coverage (2 lines in `base.py`)

2. Redundant Check in `request.kwargs` Handling

4. Code Duplication in `retry()` Method