-
Notifications
You must be signed in to change notification settings - Fork 10
feat(worker): Centralize retry logic in BaseCodecovTask and fix infinite retry bugs #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…essor - Added logic to prevent infinite retries when the maximum retry limit is exceeded. - Introduced error logging to capture details when max retries are reached. - Enhanced unit tests to verify behavior for both exceeding and not exceeding max retries.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #608 +/- ##
==========================================
+ Coverage 93.87% 93.89% +0.01%
==========================================
Files 1285 1285
Lines 46674 46681 +7
Branches 1522 1522
==========================================
+ Hits 43816 43830 +14
+ Misses 2548 2541 -7
Partials 310 310
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- Updated the LockManager's locked method to eliminate the max_retries parameter, simplifying retry handling. - Adjusted related unit tests to reflect the removal of max_retries, ensuring proper logging and behavior for retry scenarios. - Enhanced error handling to ensure retries are managed through the task's retry method instead.
- Introduced a sentinel object to differentiate between "not provided" and "explicitly None" for max_retries. - Updated the retry method to accommodate unlimited retries when max_retries is set to None, aligning with Celery's behavior. - Simplified logic for determining the effective max_retries value during task retries.
| current_retries = self.request.retries if hasattr(self, "request") else 0 | ||
| task_max_retries = max_retries if max_retries is not None else self.max_retries | ||
|
|
||
| request = getattr(self, "request", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just return out if self.request is None? The next line is going to freak out trying to call retries on None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed by adding an early check for request is None and handling it gracefully. The code now safely handles the case where request might not exist, using default values instead of potentially accessing attributes on None.
apps/worker/tasks/base.py
Outdated
| # Resolve max_retries for checking/logging | ||
| # If max_retries is not provided, use self.max_retries from task class | ||
| # If max_retries=None is explicitly provided, means unlimited retries (matches Celery behavior) | ||
| # self.max_retries is inherited from BaseCodecovTask (TASK_MAX_RETRIES_DEFAULT) or overridden by task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we eliminate some of the comments in this file, it's getting hard to read the code, either aggregate the important chunks at the top of the function definition or just remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Reduced excessive inline comments and consolidated the important information. The code is now much more readable while still maintaining clarity on the key logic.
apps/worker/tasks/base.py
Outdated
| log = logging.getLogger("worker") | ||
|
|
||
| # Sentinel object to distinguish "not provided" from "explicitly None" for max_retries | ||
| _NOT_PROVIDED = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little bit weird to make a comparison to object(). I'd prefer is this were a singleton class or some other primitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! Replaced object() with a singleton class _NotProvided which is clearer and more maintainable. This makes the sentinel pattern more explicit and easier to understand.
| # If max_retries is not provided, use self.max_retries from task class | ||
| # If max_retries=None is explicitly provided, means unlimited retries (matches Celery behavior) | ||
| # self.max_retries is inherited from BaseCodecovTask (TASK_MAX_RETRIES_DEFAULT) or overridden by task | ||
| if max_retries is _NOT_PROVIDED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the whole point of this is to override a subclasses max_retries, seems like we should just rely on the subclass having the correct max set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! The ability to override max_retries is actually needed in practice. For example, upload_processor.py uses error-specific retry limits (self.retry(max_retries=error.max_retries, ...)) where different errors may have different retry limits than the task default. This provides flexibility for dynamic retry behavior based on error conditions while still defaulting to the task's max_retries when not specified.
apps/worker/tasks/base.py
Outdated
| request_kwargs = {} | ||
| if request and hasattr(request, "kwargs"): | ||
| request_kwargs = request.kwargs if request.kwargs is not None else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to not just be request.kwargs? seems like we don't need to copy the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Simplified kwargs handling - we now use request.kwargs directly when available instead of creating unnecessary copies. The code is cleaner and more efficient.
| if request and hasattr(request, "kwargs"): | ||
| request_kwargs = request.kwargs if request.kwargs is not None else {} | ||
| # Note: kwargs parameter is Celery's special parameter for task kwargs on retry | ||
| retry_kwargs = kwargs if kwargs is not None else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just default kwargs in the function def to {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! However, using a mutable default (kwargs={}) is a Python anti-pattern. I kept kwargs=None as the default and handle it in the body, which is safer and follows Python best practices. The code now properly handles both None and empty dict cases.
apps/worker/tasks/base.py
Outdated
| # Pass max_retries to Celery: | ||
| # - If not provided, use self.max_retries (inherited from BaseCodecovTask or overridden) | ||
| # - If explicitly provided (including None), pass as-is (None = unlimited in Celery) | ||
| if max_retries is _NOT_PROVIDED: | ||
| # Not provided, use task's max_retries (inherited from BaseCodecovTask or overridden) | ||
| celery_max_retries = self.max_retries | ||
| else: | ||
| # Explicitly provided, pass as-is (None means unlimited in Celery, matches superclass behavior) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, this is hard to read with so many comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Removed excessive comments and simplified the code. The important information is now in the docstring, and the implementation is cleaner and easier to read.
| f"Task {self.name} exceeded max retries", | ||
| extra={ | ||
| "task_name": self.name, | ||
| "context": context if context else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be its own var instead of just explicitly putting the variables that we pull from kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, this is the UploadContext. It's being added to the kwargs with this PR so the base class retry logic has access to it for logging. These logs match the structure of the other logs.
apps/worker/tasks/base.py
Outdated
| if UploadFlow.has_begun(): | ||
| UploadFlow.log(UploadFlow.UNCAUGHT_RETRY_EXCEPTION) | ||
| if TestResultsFlow.has_begun(): | ||
| TestResultsFlow.log(TestResultsFlow.UNCAUGHT_RETRY_EXCEPTION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too specific for a base class to have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point! Removed UploadFlow and TestResultsFlow logging from the base class as they are too specific. These flows can handle their own logging in their specific task implementations if needed. The base class should remain generic.
- Changed the logging of context fields to use a nested dictionary for improved clarity. - Updated error and warning logs to include the new context structure, enhancing the detail of logged information during task retries and max retries exceeded scenarios.
|
|
||
| def test_bundle_analysis_process_upload_rate_limit_error( | ||
| caplog, | ||
| dbsession, | ||
| mocker, | ||
| mock_storage, | ||
| ): | ||
| """Test that rate_limit_error (retryable) causes retry with correct countdown""" | ||
| storage_path = ( | ||
| "v1/repos/testing/ed1bdd67-8fd2-4cdb-ac9e-39b99e4a3892/bundle_report.sqlite" | ||
| ) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This bug was already fixed in the implementation. The code initializes result: ProcessingResult | None = None before the try block (line 161 in apps/worker/tasks/bundle_analysis_processor.py) and checks if result is not None: in the finally block (line 218) before accessing result.bundle_report. This ensures that if an exception occurs before result is assigned, the finally block safely skips cleanup without raising a NameError.
| """Sentinel class to distinguish 'not provided' from 'explicitly None' for max_retries""" | ||
|
|
||
|
|
||
| _NOT_PROVIDED = _NotProvided() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasrockhu-codecov I don't understand why a named class is not the same as a unnamed class. This is different from how I do singletons in JavaScript. Also, while the comment says "singleton", is that really what we are doing here? The idea is to check for "set vs not-set", because Celery treats None differently from "not set".
drazisil-codecov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall Assessment
This is a solid PR that centralizes retry logic and fixes important infinite retry bugs. The approach is clean and well-tested. Here are some suggestions for improvement:
Strengths
✅ Centralizes retry logic in one place
✅ Fixes infinite retry bugs
✅ Good test coverage (95.35%)
✅ Structured logging with context
✅ Removes redundant helper methods
Suggestions for Improvement
1. Missing Test Coverage (2 lines in base.py)
The coverage report indicates 2 lines are missing coverage. Based on the code, these are likely:
- Lines 254-258: The
if request is None:edge case branch - Line 262: The
hasattr(request, "kwargs")check whenrequest.kwargsis falsy
Recommendation: Add tests for these edge cases to reach 100% coverage.
2. Redundant Check in request.kwargs Handling
request_kwargs = (
request.kwargs if hasattr(request, "kwargs") and request.kwargs else {}
)The and request.kwargs check is redundant since hasattr only checks existence. If request.kwargs exists but is None, this would still use {}, which is correct, but the check could be simplified.
Recommendation: Consider simplifying to:
request_kwargs = getattr(request, "kwargs", None) or {}3. Inconsistent Retry Check in Bundle Analysis
In bundle_analysis_processor.py line 182, there's still a manual self.request.retries < self.max_retries check before calling self.retry(). This is now redundant since self.retry() handles max_retries internally.
Recommendation: Remove the manual check to be consistent with the centralized approach:
if result.error and result.error.is_retryable:
log.warn(...)
self.retry(countdown=30 * (2**self.request.retries))4. Code Duplication in retry() Method
The super().retry() calls at lines 316-330 are nearly identical, differing only by the kwargs parameter.
Recommendation: Simplify to:
retry_kwargs_dict = {"countdown": countdown, "exc": exc, "max_retries": celery_max_retries}
if kwargs:
retry_kwargs_dict["kwargs"] = kwargs
retry_kwargs_dict.update(other_kwargs)
return super().retry(**retry_kwargs_dict)5. Documentation Enhancement
The docstring for retry() could be clearer about the sentinel pattern usage.
Recommendation: Add a note explaining why _NOT_PROVIDED is used:
"""
Override Celery's retry() to automatically check max_retries and track metrics.
The max_retries parameter uses a sentinel pattern (_NOT_PROVIDED) to distinguish:
- Not provided: Uses self.max_retries from the task class
- Explicitly None: Unlimited retries (matches Celery behavior)
- Explicit integer: Override the task's max_retries for this retry
"""6. Potential Edge Case: Empty String Values
The context extraction checks is not None, but doesn't handle empty strings. If commitid="", it would still be included in context.
Recommendation: Consider filtering out empty strings:
if all_kwargs.get("commitid") not in (None, ""):
context["commitid"] = all_kwargs.get("commitid")7. Lock Manager Documentation
The docstring for LockManager.locked() mentions that tasks should handle max_retries via self.retry(), but it might be helpful to add an example.
Recommendation: Add a brief example in the docstring showing the pattern:
"""
Raises:
LockRetry: If lock cannot be acquired, with countdown for retry scheduling.
The calling task should handle max_retries checking via self.retry().
Example:
try:
with lock_manager.locked(LockType.UPLOAD, retry_num=self.request.retries):
# do work
except LockRetry as e:
self.retry(countdown=e.countdown)
"""Minor Issues
- Line 184 in
bundle_analysis_processor.py:log.warnshould belog.warning(thoughwarnis deprecated but still works)
Questions
- Should the
request is Nonecase ever happen in production? If not, should it raise an exception instead of silently using defaults? - Is there a reason to keep the manual
max_retriescheck in bundle_analysis_processor.py, or should it be removed for consistency?
Overall, this is a well-executed refactoring that improves code maintainability and fixes critical bugs. The suggestions above are mostly minor improvements and edge case handling.
…ve logging in BundleAnalysisProcessorTask
Description
Centralizes retry handling in
BaseCodecovTask.retry()and fixes infinite retry bugs in bundle analysis tasks.Changes:
BaseCodecovTask.retry()to handlemax_retrieschecks, logging, metrics, and exponential backoffmax_retriesparameter fromLockManager.locked(); delegate retry limits to taskssafe_retry()and_retry()helper methodsrepoid,commitid,report_type) nested undercontextkeyBundleAnalysisProcessorTaskandBundleAnalysisNotifyTaskretry()logic and test real behaviorImpact:
max_retriesis exceeded