diff --git a/README.md b/README.md index 0aae76df..565e14bb 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ $ cd sourcecode $ python main.py ``` -Most versions of Python3 should work, but we have tested the code with Python 3.8. +Multiple versions of Python3 should work, but we have tested the code with Python 3.10. ### Community Notes data diff --git a/documentation/about/faq.md b/documentation/about/faq.md index 03a01155..f934ea36 100644 --- a/documentation/about/faq.md +++ b/documentation/about/faq.md @@ -47,7 +47,9 @@ Post authors with Community Notes on their posts can request additional review o {% /accordionItem %} {% accordionItem title="How do I opt-out of the program?" %} -If you join the program but later want to leave, DM @CommunityNotes to let the team know. +If you join the program but later want to leave, you can do so in your [Community Notes settings](https://x.com/i/communitynotes/notification_settings). + +If you leave and then later want to re-join, you can do so on your [Community Notes alias profile](http://x.com/i/communitynotes/u/me). {% /accordionItem %} {% /accordionSection %} diff --git a/documentation/contributing/notifications.md b/documentation/contributing/notifications.md index 9587c959..47023424 100644 --- a/documentation/contributing/notifications.md +++ b/documentation/contributing/notifications.md @@ -36,3 +36,5 @@ The default setting for all contributors is "Often", which means you'll start by ## Other notifications Contributors also receive notifications with status updates about the notes they've written and rated. At this time these are not configurable, but we plan to add more controls in the future. + +Authors of posts that are showing a note will also receive a notification about the note, once the note has been consistently showing for 6 hours without appearing or disappearing (indicating its status is relatively stable). diff --git a/documentation/contributing/top-contributors.md b/documentation/contributing/top-contributors.md index c51b739f..22a0d4a8 100644 --- a/documentation/contributing/top-contributors.md +++ b/documentation/contributing/top-contributors.md @@ -12,13 +12,12 @@ Top Writers are contributors recognized for writing a significant number of note Top Note Writers get access to: -**Writing notes about media** -Top Writers can write notes about media featured on multiple posts, keeping many more people better informed. [Learn more](./notes-on-media.md). +* **Writing notes on images & videos:** Top Writers can write notes about media featured on multiple posts, keeping many more people better informed. [Learn more](./notes-on-media.md). -**Priority for note alerts** -Top Writers’ note proposals are more likely to trigger notifications to get rater’s attention. +* **Priority for note alerts:** Top Writers’ note proposals are more likely to trigger notifications to get rater’s attention. -**Badge in alias profile** -Top Writers get a badge in their Community Notes profile. +* **Badge in alias profile:** Top Writers get a badge in their Community Notes profile. + +* **See Community Note requests:** Top Writers can see when people on X have requested Community Notes on a post — this can help identify where proposed notes might be found helpful. A contributor’s Top Writer status can always change as their notes are rated by others. diff --git a/documentation/images/note-requests.png b/documentation/images/note-requests.png new file mode 100644 index 00000000..af05da94 Binary files /dev/null and b/documentation/images/note-requests.png differ diff --git a/documentation/under-the-hood/download-data.md b/documentation/under-the-hood/download-data.md index 4ee4bef0..37e6ae26 100644 --- a/documentation/under-the-hood/download-data.md +++ b/documentation/under-the-hood/download-data.md @@ -138,9 +138,9 @@ As we iterate and improve Community Notes, we will occasionally make changes to | `noteId` | Long | The unique ID of this note. | | | `participantId` | String | A Community Notes-specific user identifier of the user who authored the rating. This is a permanent id, which remains stable even if the user changes their username/handle. | | | `createdAtMillis` | Long | Time the note was created, in milliseconds since epoch (UTC). | | -| `timestampMillisOfFirstNonNMRStatus` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note got its first status besides “Needs More Ratings”. Empty if the note never left “Needs More Ratings” status. | 1 if “Yes” is selected, 0 if “No” is selected | +| `timestampMillisOfFirstNonNMRStatus` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note got its first status besides “Needs More Ratings”. Empty if the note never left “Needs More Ratings” status. | | `firstNonNMRStatus` | String | The first status the note received when it got a status besides “Needs More Ratings”. Empty if the note never left “Needs More Ratings” status. | "", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | -| `timestampMillisOfCurrentStatus` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note got its current status, including “Needs More Ratings”. | 1 if “Yes” is selected, 0 if “No” is selected | +| `timestampMillisOfCurrentStatus` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note got its current status, including “Needs More Ratings”. This is equivalent to: when the note was last rescored. | | `currentStatus` | String | The current status of the note. | "NEEDS_MORE_RATINGS", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | | `timestampMillisOfLatestNonNMRStatus` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note most recently received a status of either “Currently Rated Helpful” or “Currently Rated Not Helpful”. This value will be the same as timestampMillisOfFirstNonNMRStatus if the note has never switched status after receiving its first non-”Needs More Rating” status. Value is empty if the note never left “Needs More Ratings” status. | "NEEDS_MORE_RATINGS", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | | `latestNonNMRStatus` | String | The latest status the note received, when it got a status besides “Needs More Ratings”. Value is empty if the note never left “Needs More Ratings” status. | "", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | @@ -152,7 +152,14 @@ As we iterate and improve Community Notes, we will occasionally make changes to | `currentExpansionStatus` | String | The current status, if any, assigned by the expansion submodel. | "", "NEEDS_MORE_RATINGS", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | | `currentGroupStatus` | String | The current status, if any, assigned by the group submodel. | "", "NEEDS_MORE_RATINGS", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | | `currentDecidedByKey` | String | The submodel whose status was used to determine the note's overall current status. | "CoreModel (v1.1)", "ExpansionModel (v1.1)", "GroupModel01 (v1.1)", "GroupModel02 (v1.1)", ..., "InsufficientExplanation (v1.0)", "ScoringDriftGuard (v1.0)" | -| `currentModelingGroup` | Int | The ID of the modeling group that this note would be scored by, if eligible to be scored by a group model (determined by the modeling groups of its raters, from the user enrollment file). 0 is a placeholder for no modeling group. | 0-13 | +| `currentModelingGroup` | Int | The ID of the modeling group that this note would be scored by, if eligible to be scored by a group model (determined by the modeling groups of its raters, from the user enrollment file). 0 is a placeholder for no modeling group. | nonnegative int | +| `timestampMillisOfMostRecentStatusChange` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note's status was last changed. Value is -1 if the note's status has never changed. | +| `timestampMillisOfNmrDueToMinStableCrhTime` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note first met the scoring criteria to become CRH, but was set to NMR due to the NmrDueToMinStableCRHTime scoring rule. | +| `currentMultiGroupStatus` | String | The current status, if any, assigned by the multi-group submodel. | "", "NEEDS_MORE_RATINGS", "CURRENTLY_RATED_HELPFUL", "CURRENTLY_RATED_NOT_HELPFUL" | +| `currentModelingMultiGroup` | Int | The ID of the multi-modeling group that this note would be scored by, if eligible to be scored by a multi group model (determined by the modeling groups of its raters, from the user enrollment file). 0 is a placeholder for no multi modeling group. | nonnegative int | +| `timestampMinuteOfFinalScoringOutput` | None | For internal use. Timestamp of scoring run. | None | +| `timestampMillisOfFirstNmrDueToMinStableCrhTime` | Long | The timestamp, in milliseconds since epoch (UTC), of when the note first met the necessary scoring rules to become CRH, but was set to a final NMR status in order to wait for the minimum amount of stable time before finally CRHing the note. | + ### Ratings diff --git a/documentation/under-the-hood/helpful-notes.md b/documentation/under-the-hood/helpful-notes.md new file mode 100644 index 00000000..4fcf5d8b --- /dev/null +++ b/documentation/under-the-hood/helpful-notes.md @@ -0,0 +1,25 @@ +--- +title: Follow @HelpfulNotes +description: How to follow Community Notes that people are finding helpful +navWeight: 3 +--- +# Follow @HelpfulNotes + +Want to keep up-to-date with the Community Notes that people are finding helpful? These notes have always been [public](download-data.md), and you can follow them on various “Helpful Notes” accounts: + +- Dutch/Nederlands - [@NuttigeOpm](http://x.com/NuttigeOpm) +- English - [@HelpfulNotes](http://x.com/HelpfulNotes) +- French/français - [@NotesUtiles](http://x.com/NotesUtiles) +- German/Deutsch - [@HilfrchAnmerkg](http://x.com/HilfrchAnmerkg) +- Italian/italiano - [@NoteUtili](http://x.com/NoteUtili) +- Japanese/日本語 - [@HelpfulNotesJP](http://x.com/HelpfulNotesJP) +- Portuguese/português - [@NotasUteis](http://x.com/NotasUteis) +- Spanish/español - [@NotasUtilesES](http://x.com/NotasUtilesES) + +### How it works + +These accounts automatically repost Community Notes that meet the following criteria: +- Have been rated Helpful by contributors +- Have kept their status of Helpful for 6hr+ +- Have a high helpfulness score (0.45+ intercept score) +- If the note contains terms that contributors have reported as sometimes overwhelming the Helpful Notes timelines ("spam", "scam", "dropship", "drop ship", "promotion") it will have a lower probability of appearing, so as to avoid overwhelming the timeline. This is experimental to determine if it improves follower satisfaction. diff --git a/documentation/under-the-hood/note-requests.md b/documentation/under-the-hood/note-requests.md new file mode 100644 index 00000000..f661953b --- /dev/null +++ b/documentation/under-the-hood/note-requests.md @@ -0,0 +1,28 @@ +--- +title: Request a Note +description: How the Request Community Note feature works +navWeight: 3 +--- +# Request a Community Note + +People on X can request a Community Note on a post they believe would benefit from one. If there are enough requests, Community Notes contributors will see an alert, and can choose to propose a note. This gives people on X who are not contributors a way to help, and lets contributors know where notes might be found helpful. + +![Button to request a note, and banner showing what a contributor sees when there are requests on a post](../images/note-requests.png) + +### Requesting a note + +- Tap the ••• menu on a post, then tap **Request Community Note** +- To be eligible to request a note, accounts must have a **verified phone number** +- Initially, accounts can make up to 5 requests per day. The limit may increase if requests successfully result in helpful notes, or may decrease if requests are on posts that people don’t agree need a note. This helps prevent spam and keep note writers focused on posts that could use helpful notes. +- *For Community Notes contributors:* During the note request pilot, we are evaluating whether it is beneficial for Community Notes contributors to have both the ability to write notes and request notes. So, initially, 50% of contributors can both write and request, and 50% can solely write. + +### Contributors seeing requests + +Contributors who have earned [Top Writer status](../contributing/top-contributors.md) will see that there are requests on a post, if there are enough requests. They can also see a timeline on posts with note requests in their [Community Notes tab](https://x.com/i/communitynotes). + +During the note request pilot: +- Requests will show on a post if the number of requests on the post is greater than or equal to MAX(5, number of views on post / 25000) +- Requests will show for 24 hours +- For a post to show up in the Note Requests timeline, the post must be recent (less than 24 hours old) + +We expect these criteria to evolve, with the goal that requests are frequently found valuable to contributors, and not noisy. The criteria are initially simple during this pilot phase. diff --git a/documentation/under-the-hood/ranking-notes.md b/documentation/under-the-hood/ranking-notes.md index d02512a7..b8fc72cc 100644 --- a/documentation/under-the-hood/ranking-notes.md +++ b/documentation/under-the-hood/ranking-notes.md @@ -305,7 +305,7 @@ For not-helpful notes: ### Prescoring -1. Pre-filter the data: to address sparsity issues, only raters with at least 10 ratings and notes with at least 5 ratings are included (although we don’t recursively filter until convergence). +1. Pre-filter the data: to address sparsity issues, only raters with at least 10 ratings and notes with at least 5 ratings are included (although we don’t recursively filter until convergence). Also, coalesce ratings made by raters with high post-selection-similarity. 2. For each scorer (Core, Expansion, ExpansionPlus, and multiple Group and Topic scorers): - Fit matrix factorization model, then assign intermediate note status labels for notes whose intercept terms (scores) are above or below thresholds. - Compute Author and Rater Helpfulness Scores based on the results of the first matrix factorization, then filter out raters with low helpfulness scores from the ratings data as described in [Filtering Ratings Based on Helpfulness Scores](./contributor-scores.md). @@ -324,6 +324,24 @@ For not-helpful notes: ## What’s New? +**Sep 17, 2024** +- Lower threshold for coalescing ratings with high post-selection-similarity. + +**Aug 21, 2024** + +**Aug 12, 2024** +- Add a 30min delay for notes that meet the CRH criteria ("NMRDueToStableCRHTime") to ensure they stably meet that criteria across multiple scoring runs before CRHing them +- Add multi-group models + +**July 25, 2024** +- Only score a subset of notes each time we run final note scoring. This doesn't affect what statuses new notes get, but does cause them to get scored more quickly. + +**May 31, 2024** +- Coalesce ratings on the same note from raters with very high post-selection-similarity. + +**May 1, 2024** +- Modify scoring thresholds for the Expanded Consensus trial to raise the note intercept threshold for Helpful notes, increase tag filtering and require a minimum Core or Expansion intercept for Helpful notes. + **April 11, 2024** - Split the scorer into separate prescoring and final scoring phases. diff --git a/documentation/under-the-hood/timeline-tabs.md b/documentation/under-the-hood/timeline-tabs.md index 9010d429..79b6c716 100644 --- a/documentation/under-the-hood/timeline-tabs.md +++ b/documentation/under-the-hood/timeline-tabs.md @@ -47,6 +47,7 @@ A variety of criteria are considered when determining whether to include a note, - Current status (e.g. "Needs More Ratings") - Current helpfulness score (e.g. not a low helpfulness score, highly rated by initial raters, possibly nearing the threshold to earn status of "Helpful") - Does not have a large number of ratings (such that more ratings could change the note's status) +- If the note contains terms that contributors have reported as sometimes overwhelming the Needs Your Help tab ("spam", "scam", "dropship", "drop ship", "promotion") it will have a lower probability of appearing, so as to avoid overwhelming the tab. This is experimental to determine if it improves contributor satisfaction. In this case of alerts, notifications are sent to a random selection of contributors, excluding the note author and those who have already rated the note. Notifications are also limited by the recipient's notification frequency setting. diff --git a/output/prescoring_meta_output.joblib b/output/prescoring_meta_output.joblib new file mode 100644 index 00000000..b99cd2a7 Binary files /dev/null and b/output/prescoring_meta_output.joblib differ diff --git a/output/prescoring_note_topic_classifier.joblib b/output/prescoring_note_topic_classifier.joblib new file mode 100644 index 00000000..040c01fb Binary files /dev/null and b/output/prescoring_note_topic_classifier.joblib differ diff --git a/requirements.txt b/requirements.txt index 206434e2..f5e547f9 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,6 @@ -numpy==1.26.2 -pandas==2.1.4 +numpy==1.26.4 +pandas==2.2.2 torch==2.1.2 scipy==1.11.4 -scikit-learn>=1.3.0 +scikit-learn==1.3.0 pyarrow diff --git a/sourcecode/scoring/constants.py b/sourcecode/scoring/constants.py index 93d16add..cd770bc3 100644 --- a/sourcecode/scoring/constants.py +++ b/sourcecode/scoring/constants.py @@ -1,13 +1,19 @@ from contextlib import contextmanager from dataclasses import dataclass +from enum import Enum +import logging import os import time -from typing import Optional +from typing import Dict, Optional, Set import numpy as np import pandas as pd +logger = logging.getLogger("birdwatch.constants") +logger.setLevel(logging.INFO) + + # Default number of threads to use in torch if os.cpu_count() is unavailable # and no value is specified. defaultNumThreads = os.cpu_count() or 8 @@ -32,6 +38,19 @@ tagPercentileForNormalization = 40 intervalHalfWidth = 0.3 +# Max flip rates +prescoringAllUnlockedNotesMaxCrhChurn = 0.2 +prescoringAllNotesCreatedThreeToThirteenDaysAgoMaxChurn = 0.06 +finalUnlockedNotesWithNoNewRatingsMaxCrhChurn = 0.05 +finalNotesWithNewRatingsMaxNewCrhChurn = 0.80 +finalNotesWithNewRatingsMaxOldCrhChurn = 0.25 +finalNotesThatJustFlippedStatusMaxCrhChurn = 1e8 +finalNotesThatFlippedRecentlyMaxCrhChurn = 1e8 +# TODO(jiansongc): adjust these 2 below +finalNotesNmrDueToMinStableCrhTimeMaxOldCrhChurn = 1.0 +finalNotesNmrDueToMinStableCrhTimeMaxNewCrhChurn = 1.0 + + # Data Filenames scoredNotesOutputPath = "scoredNotes.tsv" enrollmentInputPath = "userEnrollment-00000.tsv" @@ -50,7 +69,14 @@ authorTopNotHelpfulTagValues = "authorTopNotHelpfulTagValues" modelingPopulationKey = "modelingPopulation" modelingGroupKey = "modelingGroup" +modelingMultiGroupKey = "modelingMultiGroup" numberOfTimesEarnedOutKey = "numberOfTimesEarnedOut" +defaultIndexKey = "index" + +# Scoring Groups +coreGroups: Set[int] = {1, 2, 3, 6, 8, 9, 10, 11, 13, 14, 19, 21, 25} +expansionGroups: Set[int] = {0, 4, 5, 7, 12, 16, 18, 20, 22, 23, 24, 26, 27, 28} +expansionPlusGroups: Set[int] = {15, 17, 29, 30} # TSV Values notHelpfulValueTsv = "NOT_HELPFUL" @@ -95,6 +121,15 @@ unlockedRatingStatusKey = "unlockedRatingStatus" metaScorerActiveRulesKey = "metaScorerActiveRules" decidedByKey = "decidedBy" +rescoringActiveRulesKey = "rescoringActiveRules" + +# Note Status Changes Columns +noteFinalStatusChange = "finalStatusChange" +noteNewRatings = "newRatings" +noteDecidedByChange = "decidedByChange" +noteAllAddedRules = "allAddedRules" +noteAllRemovedRules = "allRemovedRules" +noteDecidedByInterceptChange = "decidedByInterceptChange" # Internal Scoring Columns. These columns should be renamed before writing to disk. internalNoteInterceptKey = "internalNoteIntercept" @@ -103,6 +138,7 @@ internalRaterFactorKeyBase = "internalRaterFactor" internalRatingStatusKey = "internalRatingStatus" internalActiveRulesKey = "internalActiveRules" +internalRaterReputationKey = "internalRaterReputation" scorerNameKey = "scorerName" @@ -128,6 +164,7 @@ def rater_factor_key(i): coreActiveRulesKey = "coreActiveRules" coreNoteInterceptMaxKey = "coreNoteInterceptMax" coreNoteInterceptMinKey = "coreNoteInterceptMin" +coreNumFinalRoundRatingsKey = "coreNumFinalRoundRatings" # Expansion Model expansionNoteInterceptKey = "expansionNoteIntercept" expansionNoteFactor1Key = "expansionNoteFactor1" @@ -135,11 +172,17 @@ def rater_factor_key(i): expansionNoteInterceptMaxKey = "expansionNoteInterceptMax" expansionNoteInterceptMinKey = "expansionNoteInterceptMin" expansionInternalActiveRulesKey = "expansionActiveRules" +expansionNumFinalRoundRatingsKey = "expansionNumFinalRoundRatings" +expansionRaterFactor1Key = "expansionRaterFactor1" +expansionRaterInterceptKey = "expansionRaterIntercept" # ExpansionPlus Model expansionPlusNoteInterceptKey = "expansionPlusNoteIntercept" expansionPlusNoteFactor1Key = "expansionPlusNoteFactor1" expansionPlusRatingStatusKey = "expansionPlusRatingStatus" expansionPlusInternalActiveRulesKey = "expansionPlusActiveRules" +expansionPlusNumFinalRoundRatingsKey = "expansionPlusNumFinalRoundRatings" +expansionPlusRaterFactor1Key = "expansionPlusRaterFactor1" +expansionPlusRaterInterceptKey = "expansionPlusRaterIntercept" # Coverage / Helpfulness Reputation Model coverageNoteInterceptKey = "coverageNoteIntercept" coverageNoteFactor1Key = "coverageNoteFactor1" @@ -156,19 +199,28 @@ def rater_factor_key(i): groupRaterInterceptKey = "groupRaterIntercept" groupRaterFactor1Key = "groupRaterFactor1" groupInternalActiveRulesKey = "groupActiveRules" +groupNumFinalRoundRatingsKey = "groupNumFinalRoundRatings" +# MultiGroup Model +multiGroupNoteInterceptKey = "multiGroupNoteIntercept" +multiGroupNoteFactor1Key = "multiGroupNoteFactor1" +multiGroupRatingStatusKey = "multiGroupRatingStatus" +multiGroupRaterInterceptKey = "multiGroupRaterIntercept" +multiGroupRaterFactor1Key = "multiGroupRaterFactor1" +multiGroupInternalActiveRulesKey = "multiGroupActiveRules" +multiGroupNumFinalRoundRatingsKey = "multiGroupNumFinalRoundRatings" # Topic Model topicNoteInterceptKey = "topicNoteIntercept" topicNoteFactor1Key = "topicNoteFactor1" topicRatingStatusKey = "topicRatingStatus" topicNoteConfidentKey = "topicNoteConfident" topicInternalActiveRulesKey = "topicActiveRules" +topicNumFinalRoundRatingsKey = "topicNumFinalRoundRatings" # Harassment/Abuse Tag harassmentNoteInterceptKey = "harassmentNoteIntercept" harassmentNoteFactor1Key = "harassmentNoteFactor1" harassmentRaterInterceptKey = "harassmentRaterIntercept" harassmentRaterFactor1Key = "harassmentRaterFactor1" - # Ids and Indexes noteIdKey = "noteId" tweetIdKey = "tweetId" @@ -183,6 +235,7 @@ def rater_factor_key(i): numRatingsLast28DaysKey = "numRatingsLast28" ratingFromInitialModelingGroupKey = "ratingFromInitialModelingGroup" percentFromInitialModelingGroupKey = "percentFromInitialModelingGroup" +numFinalRoundRatingsKey = "numFinalRoundRatings" # Helpfulness Score Keys crhRatioKey = "CRHRatio" @@ -199,6 +252,9 @@ def rater_factor_key(i): currentlyRatedHelpful = "CURRENTLY_RATED_HELPFUL" currentlyRatedNotHelpful = "CURRENTLY_RATED_NOT_HELPFUL" needsMoreRatings = "NEEDS_MORE_RATINGS" +# FIRM_REJECT is set by individual scorers to indicate downstream scorers should not CRH +# a note, but is never set as the finalRatingStatus of a note. +firmReject = "FIRM_REJECT" # Boolean Note Status Labels currentlyRatedHelpfulBoolKey = "crhBool" @@ -219,8 +275,10 @@ def rater_factor_key(i): (1, "helpfulUnbiasedLanguage"), ] helpfulTagsTSVOrder = [tag for (tiebreakOrder, tag) in helpfulTagsAndTieBreakOrder] -helpfulTagsAndTypesTSVOrder = [(tag, np.int64) for tag in helpfulTagsTSVOrder] +helpfulTagBoolsAndTypesTSVOrder = [(tag, pd.Int8Dtype()) for tag in helpfulTagsTSVOrder] helpfulTagsTiebreakOrder = [tag for (tiebreakOrder, tag) in sorted(helpfulTagsAndTieBreakOrder)] +helpfulTagCountsAndTypesTSVOrder = [(tag, pd.Int64Dtype()) for tag in helpfulTagsTSVOrder] + # NOTE: Always add new tags to the end of this list, and *never* change the order of # elements which are already in the list to maintain compatibility with @@ -257,7 +315,8 @@ def rater_factor_key(i): (6, notHelpfulNoteNotNeededKey), ] notHelpfulTagsTSVOrder = [tag for (tiebreakOrder, tag) in notHelpfulTagsAndTieBreakOrder] -notHelpfulTagsAndTypesTSVOrder = [(tag, np.int64) for tag in notHelpfulTagsTSVOrder] +notHelpfulTagsAndTypesTSVOrder = [(tag, pd.Int8Dtype()) for tag in notHelpfulTagsTSVOrder] +notHelpfulTagCountsAndTypesTSVOrder = [(tag, pd.Int64Dtype()) for tag in notHelpfulTagsTSVOrder] notHelpfulTagsTiebreakOrder = [ tag for (tiebreakOrder, tag) in sorted(notHelpfulTagsAndTieBreakOrder) ] @@ -269,20 +328,58 @@ def rater_factor_key(i): } adjustedSuffix = "Adjusted" notHelpfulTagsAdjustedColumns = [f"{column}{adjustedSuffix}" for column in notHelpfulTagsTSVOrder] +notHelpfulTagsAdjustedTSVColumnsAndTypes = [ + (tag, np.double) for tag in notHelpfulTagsAdjustedColumns +] ratioSuffix = "Ratio" notHelpfulTagsAdjustedRatioColumns = [ f"{column}{ratioSuffix}" for column in notHelpfulTagsAdjustedColumns ] +notHelpfulTagsAdjustedRatioTSVColumnsAndTypes = [ + (tag, np.double) for tag in notHelpfulTagsAdjustedRatioColumns +] ratingWeightKey = "ratingWeight" +incorrectTagRatingsMadeByRaterKey = "incorrectTagRatingsMadeByRater" +totalRatingsMadeByRaterKey = "totalRatingsMadeByRater" + +noteTfIdfIncorrectScoreKey = "tf_idf_incorrect" +numVotersKey = "num_voters" # num voters who rated a note +incorrectTagRateByRaterKey = "p_incorrect_user" + +noteTfIdfIncorrectScoreIntervalKey = ( + "tf_idf_incorrect_interval" # note's tf-idf scores from within the interval +) +numVotersIntervalKey = "num_voters_interval" # num voters (in the interval) who rated a note +sumOfIncorrectTagRateByRaterIntervalKey = ( + "p_incorrect_user_interval" +) # sum of p_incorrect_user for all raters who rated a note in the interval +notHelpfulIncorrectIntervalKey = ( + "notHelpfulIncorrect_interval" # notHelpfulIncorrect ratings on the note in the interval +) + lowDiligenceInterceptKey = "lowDiligenceIntercept" -incorrectFilterColumns = [ - "notHelpfulIncorrect_interval", - "p_incorrect_user_interval", - "num_voters_interval", - "tf_idf_incorrect_interval", - lowDiligenceInterceptKey, + + +lowDiligenceRaterFactor1Key = "lowDiligenceRaterFactor1" +lowDiligenceRaterInterceptKey = "lowDiligenceRaterIntercept" +lowDiligenceRaterReputationKey = "lowDiligenceRaterReputation" +lowDiligenceNoteFactor1Key = "lowDiligenceNoteFactor1" +lowDiligenceNoteInterceptKey = "lowDiligenceNoteIntercept" +lowDiligenceLegacyNoteInterceptKey = "lowDiligenceIntercept" +lowDiligenceNoteInterceptRound2Key = "lowDiligenceNoteInterceptRound2" +internalNoteInterceptRound2Key = "internalNoteInterceptRound2" +lowDiligenceRaterInterceptRound2Key = "lowDiligenceRaterInterceptRound2" +internalRaterInterceptRound2Key = "internalRaterInterceptRound2" + +incorrectFilterColumnsAndTypes = [ + (notHelpfulIncorrectIntervalKey, np.double), + (sumOfIncorrectTagRateByRaterIntervalKey, np.double), + (numVotersIntervalKey, np.double), + (noteTfIdfIncorrectScoreIntervalKey, np.double), + (lowDiligenceLegacyNoteInterceptKey, np.double), ] +incorrectFilterColumns = [col for (col, _) in incorrectFilterColumnsAndTypes] misleadingTags = [ "misleadingOther", @@ -293,7 +390,7 @@ def rater_factor_key(i): "misleadingUnverifiedClaimAsFact", "misleadingSatire", ] -misleadingTagsAndTypes = [(tag, np.int64) for tag in misleadingTags] +misleadingTagsAndTypes = [(tag, pd.Int8Dtype()) for tag in misleadingTags] notMisleadingTags = [ "notMisleadingOther", @@ -302,8 +399,7 @@ def rater_factor_key(i): "notMisleadingClearlySatire", "notMisleadingPersonalOpinion", ] -notMisleadingTagsAndTypes = [(tag, np.int64) for tag in notMisleadingTags] - +notMisleadingTagsAndTypes = [(tag, pd.Int8Dtype()) for tag in notMisleadingTags] noteTSVColumnsAndTypes = ( [ @@ -312,13 +408,13 @@ def rater_factor_key(i): (createdAtMillisKey, np.int64), (tweetIdKey, np.int64), (classificationKey, object), - ("believable", object), - ("harmful", object), - ("validationDifficulty", object), + ("believable", "category"), + ("harmful", "category"), + ("validationDifficulty", "category"), ] + misleadingTagsAndTypes + notMisleadingTagsAndTypes - + [("trustworthySources", np.int64), (summaryKey, object), ("isMediaNote", np.int64)] + + [("trustworthySources", pd.Int8Dtype()), (summaryKey, object), ("isMediaNote", pd.Int8Dtype())] ) noteTSVColumns = [col for (col, dtype) in noteTSVColumnsAndTypes] noteTSVTypes = [dtype for (col, dtype) in noteTSVColumnsAndTypes] @@ -333,14 +429,14 @@ def rater_factor_key(i): (noteIdKey, np.int64), (raterParticipantIdKey, object), (createdAtMillisKey, np.int64), - (versionKey, np.int64), - (agreeKey, np.int64), - (disagreeKey, np.int64), - (helpfulKey, np.int64), - (notHelpfulKey, np.int64), - (helpfulnessLevelKey, object), + (versionKey, pd.Int8Dtype()), + (agreeKey, pd.Int8Dtype()), + (disagreeKey, pd.Int8Dtype()), + (helpfulKey, pd.Int8Dtype()), + (notHelpfulKey, pd.Int8Dtype()), + (helpfulnessLevelKey, "category"), ] - + helpfulTagsAndTypesTSVOrder + + helpfulTagBoolsAndTypesTSVOrder + notHelpfulTagsAndTypesTSVOrder + [(ratedOnTweetIdKey, np.int64)] ) @@ -349,7 +445,6 @@ def rater_factor_key(i): ratingTSVTypes = [dtype for (col, dtype) in ratingTSVColumnsAndTypes] ratingTSVTypeMapping = {col: dtype for (col, dtype) in ratingTSVColumnsAndTypes} - timestampMillisOfNoteFirstNonNMRLabelKey = "timestampMillisOfFirstNonNMRStatus" firstNonNMRLabelKey = "firstNonNMRStatus" timestampMillisOfNoteCurrentLabelKey = "timestampMillisOfCurrentStatus" @@ -364,31 +459,52 @@ def rater_factor_key(i): currentGroupStatusKey = "currentGroupStatus" currentDecidedByKey = "currentDecidedBy" currentModelingGroupKey = "currentModelingGroup" +timestampMillisOfMostRecentStatusChangeKey = "timestampMillisOfMostRecentStatusChange" +currentMultiGroupStatusKey = "currentMultiGroupStatus" +currentModelingMultiGroupKey = "currentModelingMultiGroup" +timestampMillisOfNmrDueToMinStableCrhTimeKey = "timestampMillisOfNmrDueToMinStableCrhTime" +updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey = ( + "updatedTimestampMillisOfNmrDueToMinStableCrhTime" +) +timestampMinuteOfFinalScoringOutput = "timestampMinuteOfFinalScoringOutput" +timestampMillisOfFirstNmrDueToMinStableCrhTimeKey = "timestampMillisOfFirstNmrDueToMinStableCrhTime" noteStatusHistoryTSVColumnsAndTypes = [ (noteIdKey, np.int64), (noteAuthorParticipantIdKey, object), (createdAtMillisKey, np.int64), (timestampMillisOfNoteFirstNonNMRLabelKey, np.double), # double because nullable. - (firstNonNMRLabelKey, object), + (firstNonNMRLabelKey, "category"), (timestampMillisOfNoteCurrentLabelKey, np.double), # double because nullable. - (currentLabelKey, object), + (currentLabelKey, "category"), (timestampMillisOfNoteMostRecentNonNMRLabelKey, np.double), # double because nullable. - (mostRecentNonNMRLabelKey, object), + (mostRecentNonNMRLabelKey, "category"), (timestampMillisOfStatusLockKey, np.double), # double because nullable. - (lockedStatusKey, object), + (lockedStatusKey, "category"), (timestampMillisOfRetroLockKey, np.double), # double because nullable. - (currentCoreStatusKey, object), - (currentExpansionStatusKey, object), - (currentGroupStatusKey, object), - (currentDecidedByKey, object), - (currentModelingGroupKey, object), + (currentCoreStatusKey, "category"), + (currentExpansionStatusKey, "category"), + (currentGroupStatusKey, "category"), + (currentDecidedByKey, "category"), + (currentModelingGroupKey, np.double), # TODO: int + (timestampMillisOfMostRecentStatusChangeKey, np.double), # double because nullable. + (timestampMillisOfNmrDueToMinStableCrhTimeKey, np.double), # double because nullable. + (currentMultiGroupStatusKey, "category"), + (currentModelingMultiGroupKey, np.double), # TODO: int + (timestampMinuteOfFinalScoringOutput, np.double), # double because nullable. + (timestampMillisOfFirstNmrDueToMinStableCrhTimeKey, np.double), # double because nullable. ] noteStatusHistoryTSVColumns = [col for (col, dtype) in noteStatusHistoryTSVColumnsAndTypes] noteStatusHistoryTSVTypes = [dtype for (col, dtype) in noteStatusHistoryTSVColumnsAndTypes] noteStatusHistoryTSVTypeMapping = { col: dtype for (col, dtype) in noteStatusHistoryTSVColumnsAndTypes } +# TODO(jiansongc): clean up after new column is in production. +noteStatusHistoryTSVColumnsOld = noteStatusHistoryTSVColumns[:-1] +noteStatusHistoryTSVColumnsAndTypesOld = noteStatusHistoryTSVColumnsAndTypes[:-1] +noteStatusHistoryTSVTypeMappingOld = { + col: dtype for (col, dtype) in noteStatusHistoryTSVColumnsAndTypesOld +} # Earn In + Earn Out @@ -404,6 +520,7 @@ def rater_factor_key(i): earnedOutNoAcknowledge = "earnedOutNoAcknowledge" earnedOutAcknowledged = "earnedOutAcknowledged" newUser = "newUser" +removed = "removed" isAtRiskCRNHCount = 2 ratingImpactForEarnIn = 5 ratingImpact = "ratingImpact" @@ -413,6 +530,7 @@ def rater_factor_key(i): earnedOutNoAcknowledge: 2, earnedOutAcknowledged: 3, newUser: 4, + removed: 5, } emergingWriterDays = 28 isEmergingWriterKey = "isEmergingWriter" @@ -432,7 +550,7 @@ def rater_factor_key(i): (successfulRatingNeededToEarnIn, np.int64), (timestampOfLastStateChange, np.int64), (timestampOfLastEarnOut, np.double), # double because nullable. - (modelingPopulationKey, str), + (modelingPopulationKey, "category"), (modelingGroupKey, np.float64), (numberOfTimesEarnedOutKey, np.int64), ] @@ -476,30 +594,36 @@ def rater_factor_key(i): col: dtype for (col, dtype) in noteParameterUncertaintyTSVColumnsAndTypes } -auxiliaryScoredNotesTSVColumns = ( +auxiliaryScoredNotesTSVColumnsAndTypes = ( [ - noteIdKey, - ratingWeightKey, - createdAtMillisKey, - noteAuthorParticipantIdKey, - awaitingMoreRatingsBoolKey, - numRatingsLast28DaysKey, - currentLabelKey, - currentlyRatedHelpfulBoolKey, - currentlyRatedNotHelpfulBoolKey, - unlockedRatingStatusKey, + (noteIdKey, np.int64), + (ratingWeightKey, np.double), + (createdAtMillisKey, np.int64), + (noteAuthorParticipantIdKey, object), + (awaitingMoreRatingsBoolKey, np.int8), + (numRatingsLast28DaysKey, np.int64), + (currentLabelKey, str), + (currentlyRatedHelpfulBoolKey, np.int8), + (currentlyRatedNotHelpfulBoolKey, np.int8), + (unlockedRatingStatusKey, str), ] - + helpfulTagsTSVOrder - + notHelpfulTagsTSVOrder - + notHelpfulTagsAdjustedColumns - + notHelpfulTagsAdjustedRatioColumns - + incorrectFilterColumns + + helpfulTagCountsAndTypesTSVOrder + + notHelpfulTagCountsAndTypesTSVOrder + + notHelpfulTagsAdjustedTSVColumnsAndTypes + + notHelpfulTagsAdjustedRatioTSVColumnsAndTypes + + incorrectFilterColumnsAndTypes ) +auxiliaryScoredNotesTSVColumns = [col for (col, dtype) in auxiliaryScoredNotesTSVColumnsAndTypes] +auxiliaryScoredNotesTSVTypeMapping = { + col: dtype for (col, dtype) in auxiliaryScoredNotesTSVColumnsAndTypes +} deprecatedNoteModelOutputColumns = frozenset( { coverageNoteInterceptMinKey, coverageNoteInterceptMaxKey, + groupNoteInterceptMinKey, + groupNoteInterceptMaxKey, } ) @@ -508,6 +632,9 @@ def rater_factor_key(i): (internalNoteInterceptKey, np.double), (internalNoteFactor1Key, np.double), (scorerNameKey, str), + (lowDiligenceNoteInterceptKey, np.double), + (lowDiligenceNoteFactor1Key, np.double), + (lowDiligenceNoteInterceptRound2Key, np.double), ] prescoringNoteModelOutputTSVColumns = [ col for (col, dtype) in prescoringNoteModelOutputTSVColumnsAndTypes @@ -520,52 +647,64 @@ def rater_factor_key(i): (noteIdKey, np.int64), (coreNoteInterceptKey, np.double), (coreNoteFactor1Key, np.double), - (finalRatingStatusKey, str), - (firstTagKey, str), - (secondTagKey, str), + (finalRatingStatusKey, "category"), + (firstTagKey, "category"), + (secondTagKey, "category"), # Note that this column was formerly named "activeRules" and the name is now # updated to "coreActiveRules". The data values remain the compatible, # but the new column only contains rules that ran when deciding status based on # the core model. - (coreActiveRulesKey, str), - (activeFilterTagsKey, str), - (classificationKey, str), + (coreActiveRulesKey, "category"), + (activeFilterTagsKey, "category"), + (classificationKey, "category"), (createdAtMillisKey, np.int64), - (coreRatingStatusKey, str), - (metaScorerActiveRulesKey, str), - (decidedByKey, str), + (coreRatingStatusKey, "category"), + (metaScorerActiveRulesKey, "category"), + (decidedByKey, "category"), (expansionNoteInterceptKey, np.double), (expansionNoteFactor1Key, np.double), - (expansionRatingStatusKey, str), + (expansionRatingStatusKey, "category"), (coverageNoteInterceptKey, np.double), (coverageNoteFactor1Key, np.double), - (coverageRatingStatusKey, str), + (coverageRatingStatusKey, "category"), (coreNoteInterceptMinKey, np.double), (coreNoteInterceptMaxKey, np.double), - (expansionNoteInterceptMinKey, np.double), - (expansionNoteInterceptMaxKey, np.double), - (coverageNoteInterceptMinKey, np.double), - (coverageNoteInterceptMaxKey, np.double), + (expansionNoteInterceptMinKey, "category"), # category because always nan + (expansionNoteInterceptMaxKey, "category"), # category because always nan + (coverageNoteInterceptMinKey, "category"), # category because always nan + (coverageNoteInterceptMaxKey, "category"), # category because always nan (groupNoteInterceptKey, np.double), (groupNoteFactor1Key, np.double), - (groupRatingStatusKey, str), - (groupNoteInterceptMaxKey, np.double), - (groupNoteInterceptMinKey, np.double), + (groupRatingStatusKey, "category"), + (groupNoteInterceptMaxKey, "category"), # category because always nan + (groupNoteInterceptMinKey, "category"), # category because always nan (modelingGroupKey, np.float64), (numRatingsKey, np.int64), (timestampMillisOfNoteCurrentLabelKey, np.double), (expansionPlusNoteInterceptKey, np.double), (expansionPlusNoteFactor1Key, np.double), - (expansionPlusRatingStatusKey, str), + (expansionPlusRatingStatusKey, "category"), (topicNoteInterceptKey, np.double), (topicNoteFactor1Key, np.double), - (topicRatingStatusKey, str), - (noteTopicKey, str), - (topicNoteConfidentKey, str), - (expansionInternalActiveRulesKey, str), - (expansionPlusInternalActiveRulesKey, str), - (groupInternalActiveRulesKey, str), - (topicInternalActiveRulesKey, str), + (topicRatingStatusKey, "category"), + (noteTopicKey, "category"), + (topicNoteConfidentKey, pd.BooleanDtype()), + (expansionInternalActiveRulesKey, "category"), + (expansionPlusInternalActiveRulesKey, "category"), + (groupInternalActiveRulesKey, "category"), + (topicInternalActiveRulesKey, "category"), + (coreNumFinalRoundRatingsKey, np.double), # double because nullable. + (expansionNumFinalRoundRatingsKey, np.double), # double because nullable. + (expansionPlusNumFinalRoundRatingsKey, np.double), # double because nullable. + (groupNumFinalRoundRatingsKey, np.double), # double because nullable. + (topicNumFinalRoundRatingsKey, np.double), # double because nullable. + (rescoringActiveRulesKey, "category"), + (multiGroupNoteInterceptKey, np.double), + (multiGroupNoteFactor1Key, np.double), + (multiGroupRatingStatusKey, str), + (modelingMultiGroupKey, np.float64), + (multiGroupInternalActiveRulesKey, str), + (multiGroupNumFinalRoundRatingsKey, np.double), # double because nullable. ] noteModelOutputTSVColumns = [col for (col, dtype) in noteModelOutputTSVColumnsAndTypes] noteModelOutputTSVTypeMapping = {col: dtype for (col, dtype) in noteModelOutputTSVColumnsAndTypes} @@ -575,6 +714,8 @@ def rater_factor_key(i): if col in deprecatedNoteModelOutputColumns ] +postSelectionValueKey = "postSelectionValue" + prescoringRaterModelOutputTSVColumnsAndTypes = [ (raterParticipantIdKey, object), (internalRaterInterceptKey, np.double), @@ -582,11 +723,16 @@ def rater_factor_key(i): (crhCrnhRatioDifferenceKey, np.double), (meanNoteScoreKey, np.double), (raterAgreeRatioKey, np.double), - ( - aboveHelpfulnessThresholdKey, - "boolean", - ), # nullable bool https://pandas.pydata.org/docs/user_guide/boolean.html + (aboveHelpfulnessThresholdKey, pd.BooleanDtype()), (scorerNameKey, str), + (internalRaterReputationKey, np.double), + (lowDiligenceRaterInterceptKey, np.double), + (lowDiligenceRaterFactor1Key, np.double), + (lowDiligenceRaterReputationKey, np.double), + (lowDiligenceRaterInterceptRound2Key, np.double), + (incorrectTagRatingsMadeByRaterKey, pd.Int64Dtype()), + (totalRatingsMadeByRaterKey, pd.Int64Dtype()), + (postSelectionValueKey, pd.Int64Dtype()), ] prescoringRaterModelOutputTSVColumns = [ col for (col, dtype) in prescoringRaterModelOutputTSVColumnsAndTypes @@ -617,7 +763,7 @@ def rater_factor_key(i): (successfulRatingNeededToEarnIn, pd.Int64Dtype()), (authorTopNotHelpfulTagValues, str), (timestampOfLastStateChange, np.double), - (aboveHelpfulnessThresholdKey, np.float64), # nullable bool + (aboveHelpfulnessThresholdKey, np.float64), # nullable bool. (isEmergingWriterKey, pd.BooleanDtype()), (aggregateRatingReceivedTotal, pd.Int64Dtype()), (timestampOfLastEarnOut, np.double), @@ -626,10 +772,61 @@ def rater_factor_key(i): (modelingGroupKey, np.float64), (raterHelpfulnessReputationKey, np.double), (numberOfTimesEarnedOutKey, np.float64), + (expansionRaterInterceptKey, np.double), + (expansionRaterFactor1Key, np.double), + (expansionPlusRaterInterceptKey, np.double), + (expansionPlusRaterFactor1Key, np.double), + (multiGroupRaterInterceptKey, np.double), + (multiGroupRaterFactor1Key, np.double), + (modelingMultiGroupKey, np.float64), ] raterModelOutputTSVColumns = [col for (col, dtype) in raterModelOutputTSVColumnsAndTypes] raterModelOutputTSVTypeMapping = {col: dtype for (col, dtype) in raterModelOutputTSVColumnsAndTypes} +noteStatusChangesPrev = "_prev" +noteStatusChangesDerivedColumnsAndTypes = [ + (noteIdKey, np.int64), + (noteFinalStatusChange, str), + (noteNewRatings, np.int64), + (noteDecidedByChange, str), + (noteAllAddedRules, str), + (noteAllRemovedRules, str), + (noteDecidedByInterceptChange, str), +] +noteStatusChangesRemovedCols = [ + col + for col in noteModelOutputTSVColumns + if ("NoteInterceptMin" in col) or ("NoteInterceptMax" in col) +] +noteStatusChangesModelOutputColumnsAndTypes = [ + (col, t) + for (col, t) in noteModelOutputTSVColumnsAndTypes + if col not in noteStatusChangesRemovedCols + [noteIdKey] +] +noteStatusChangesModelOutputWithPreviousColumnsAndTypes = ( + noteStatusChangesModelOutputColumnsAndTypes + + [(col + noteStatusChangesPrev, t) for (col, t) in noteStatusChangesModelOutputColumnsAndTypes] +) + +noteStatusChangeTSVColumnsAndTypes = noteStatusChangesDerivedColumnsAndTypes + sorted( + noteStatusChangesModelOutputWithPreviousColumnsAndTypes, key=lambda tup: tup[0] +) +noteStatusChangesTSVColumns = [col for (col, dtype) in noteStatusChangeTSVColumnsAndTypes] +noteStatusChangesTSVTypeMapping = { + col: dtype for (col, dtype) in noteStatusChangeTSVColumnsAndTypes +} + +datasetKeyKey = "datasetKey" +partitionToReadKey = "partitionToRead" +fileNameToReadKey = "fileNameToRead" +inputPathsTSVColumnsAndTypes = [ + (datasetKeyKey, str), + (partitionToReadKey, str), + (fileNameToReadKey, str), +] +inputPathsTSVColumns = [col for (col, _) in inputPathsTSVColumnsAndTypes] +inputPathsTSVTypeMapping = {col: dtype for (col, dtype) in inputPathsTSVColumnsAndTypes} + @contextmanager def time_block(label): @@ -638,16 +835,36 @@ def time_block(label): yield finally: end = time.time() - print(f"{label} elapsed time: {end - start:.2f} secs ({((end-start)/60.0):.2f} mins)") + logger.info(f"{label} elapsed time: {end - start:.2f} secs ({((end - start) / 60.0):.2f} mins)") + + +### TODO: weave through second round intercept. +@dataclass +class ReputationGlobalIntercept: + firstRound: float + secondRound: float + finalRound: float + + +@dataclass +class PrescoringMetaScorerOutput: + globalIntercept: Optional[float] + lowDiligenceGlobalIntercept: Optional[ReputationGlobalIntercept] + tagFilteringThresholds: Optional[Dict[str, float]] # tag => threshold + finalRoundNumRatings: Optional[int] + finalRoundNumNotes: Optional[int] + finalRoundNumUsers: Optional[int] + + +@dataclass +class PrescoringMetaOutput: + metaScorerOutput: Dict[str, PrescoringMetaScorerOutput] # scorerName => output @dataclass class SharedMemoryDataframeInfo: sharedMemoryName: str - columns: list - dataShape: tuple - dtypesDict: dict - npDtype: str + dataSize: int @dataclass @@ -692,6 +909,7 @@ class PrescoringArgs(ScoringArgs): class FinalScoringArgs(ScoringArgs): prescoringNoteModelOutput: pd.DataFrame prescoringRaterModelOutput: pd.DataFrame + prescoringMetaOutput: PrescoringMetaOutput def remove_large_args_for_multiprocessing(self): self.ratings = None @@ -707,3 +925,22 @@ class ModelResult: helpfulnessScores: pd.DataFrame auxiliaryNoteInfo: pd.DataFrame scorerName: Optional[str] + metaScores: Optional[PrescoringMetaScorerOutput] + + +class RescoringRuleID(Enum): + ALL_NOTES = 1 + NOTES_WITH_NEW_RATINGS = 2 + NOTES_FLIPPED_PREVIOUS_RUN = 3 + NEW_NOTES_NOT_RESCORED_RECENTLY_ENOUGH = 4 + RECENTLY_FLIPPED_NOTES_NOT_RESCORED_RECENTLY_ENOUGH = 5 + NMR_DUE_TO_MIN_STABLE_CRH_TIME = 6 + NOTES_CREATED_SOMEWHAT_RECENTLY = 7 + + +@dataclass +class NoteSubset: + noteSet: Optional[set] + maxNewCrhChurnRate: float + maxOldCrhChurnRate: float + description: RescoringRuleID diff --git a/sourcecode/scoring/contributor_state.py b/sourcecode/scoring/contributor_state.py index 5cbf57ef..be8e2841 100644 --- a/sourcecode/scoring/contributor_state.py +++ b/sourcecode/scoring/contributor_state.py @@ -1,3 +1,5 @@ +import logging + from . import constants as c, explanation_tags from .helpfulness_scores import author_helpfulness from .note_ratings import get_ratings_with_scores, get_valid_ratings @@ -5,6 +7,10 @@ import pandas as pd +logger = logging.getLogger("birdwatch.contributor_state") +logger.setLevel(logging.INFO) + + def should_earn_in(contributorScoresWithEnrollment: pd.DataFrame): """ The participant should earn in when they are in the earnedOutAcknowledged, earnedoutNoAck and newUser state. @@ -17,7 +23,8 @@ def should_earn_in(contributorScoresWithEnrollment: pd.DataFrame): authorEnrollmentCounts (pd.DataFrame): Scored Notes + User Enrollment status """ return ( - (contributorScoresWithEnrollment[c.enrollmentState] != c.earnedIn) + (contributorScoresWithEnrollment[c.enrollmentState] != c.removed) + & (contributorScoresWithEnrollment[c.enrollmentState] != c.earnedIn) & (contributorScoresWithEnrollment[c.enrollmentState] != c.atRisk) & ( contributorScoresWithEnrollment[c.ratingImpact] @@ -36,7 +43,8 @@ def newly_at_risk(authorEnrollmentCounts: pd.DataFrame): authorEnrollmentCounts (pd.DataFrame): Scored Notes + User Enrollment status """ return ( - (authorEnrollmentCounts[c.enrollmentState] != c.newUser) + (authorEnrollmentCounts[c.enrollmentState] != c.removed) + & (authorEnrollmentCounts[c.enrollmentState] != c.newUser) & (authorEnrollmentCounts[c.enrollmentState] != c.earnedOutNoAcknowledge) & (authorEnrollmentCounts[c.enrollmentState] != c.earnedOutAcknowledged) & (authorEnrollmentCounts[c.enrollmentState] != c.atRisk) @@ -55,7 +63,8 @@ def is_earned_out(authorEnrollmentCounts: pd.DataFrame): authorEnrollmentCounts (pd.DataFrame): Scored Notes + User Enrollment status """ return ( - (authorEnrollmentCounts[c.enrollmentState] != c.newUser) + (authorEnrollmentCounts[c.enrollmentState] != c.removed) + & (authorEnrollmentCounts[c.enrollmentState] != c.newUser) & (authorEnrollmentCounts[c.enrollmentState] != c.earnedOutAcknowledged) & (authorEnrollmentCounts[c.notesCurrentlyRatedNotHelpful] > c.isAtRiskCRNHCount) ) @@ -71,7 +80,8 @@ def newly_earned_in(authorEnrollmentCounts): authorEnrollmentCounts (pd.DataFrame): Scored Notes + User Enrollment status """ return ( - (authorEnrollmentCounts[c.enrollmentState] != c.newUser) + (authorEnrollmentCounts[c.enrollmentState] != c.removed) + & (authorEnrollmentCounts[c.enrollmentState] != c.newUser) & (authorEnrollmentCounts[c.enrollmentState] != c.earnedOutAcknowledged) & (authorEnrollmentCounts[c.enrollmentState] != c.earnedOutNoAcknowledge) & (authorEnrollmentCounts[c.enrollmentState] != c.earnedIn) @@ -120,21 +130,21 @@ def _get_rated_after_decision( assert ( len(ratingInfos) == len(ratings) ), f"assigning a status timestamp shouldn't decrease number of ratings: {len(ratingInfos)} vs. {len(ratings)}" - print("Calculating ratedAfterDecision:") - print(f" Total ratings: {len(ratingInfos)}") + logger.info("Calculating ratedAfterDecision:") + logger.info(f" Total ratings: {len(ratingInfos)}") ratingInfos = ratingInfos[~pd.isna(ratingInfos[c.timestampMillisOfNoteMostRecentNonNMRLabelKey])] - print(f" Total ratings on notes with status: {len(ratingInfos)}") + logger.info(f" Total ratings on notes with status: {len(ratingInfos)}") ratingInfos = ratingInfos[ ratingInfos[c.createdAtMillisKey] > ratingInfos[c.timestampMillisOfNoteMostRecentNonNMRLabelKey] ] - print(f" Total ratings after status: {len(ratingInfos)}") + logger.info(f" Total ratings after status: {len(ratingInfos)}") ratingInfos[c.ratedAfterDecision] = 1 ratedAfterDecision = ( ratingInfos[[c.raterParticipantIdKey, c.ratedAfterDecision]] .groupby(c.raterParticipantIdKey) .sum() ) - print(f" Total raters rating after decision: {len(ratedAfterDecision)}") + logger.info(f" Total raters rating after decision: {len(ratedAfterDecision)}") return ratedAfterDecision @@ -165,17 +175,25 @@ def _get_visible_rating_counts( ratingCounts = validRatings.groupby(c.raterParticipantIdKey).sum()[ratingCountRows] ratingsWithScores = get_ratings_with_scores(ratings, noteStatusHistory, scoredNotes) + historyCounts = ratingsWithScores.groupby(c.raterParticipantIdKey).sum()[ [c.awaitingMoreRatingsBoolKey] ] historyCounts[c.ratingsAwaitingMoreRatings] = historyCounts[c.awaitingMoreRatingsBoolKey] ratedAfterDecision = _get_rated_after_decision(ratings, noteStatusHistory) - historyCounts = historyCounts.merge(ratedAfterDecision, on=c.raterParticipantIdKey, how="left") + historyCounts = historyCounts.merge( + ratedAfterDecision, + on=c.raterParticipantIdKey, + how="left", + unsafeAllowed=c.ratedAfterDecision, + ) # Fill in zero for any rater who didn't rate any notes after status was assigned and consequently # doesn't appear in the dataframe. historyCounts = historyCounts.fillna({c.ratedAfterDecision: 0}) - ratingCounts = ratingCounts.merge(historyCounts, on=c.raterParticipantIdKey, how="outer") + ratingCounts = ratingCounts.merge( + historyCounts, on=c.raterParticipantIdKey, how="outer", unsafeAllowed=set(ratingCountRows) + ) for rowName in ratingCountRows: ratingCounts[rowName] = ratingCounts[rowName].fillna(0) return ratingCounts @@ -310,7 +328,7 @@ def is_emerging_writer(scoredNotes: pd.DataFrame): """ authorCounts = author_helpfulness(scoredNotes, c.coreNoteInterceptKey) raterCounts = scoredNotes.groupby(c.noteAuthorParticipantIdKey).sum(numeric_only=True)[ - c.numRatingsLast28DaysKey + [c.numRatingsLast28DaysKey] ] emergingWriter = ( authorCounts.join(raterCounts, how="outer", lsuffix="_author", rsuffix="_rater") @@ -349,6 +367,7 @@ def single_trigger_earn_out(contributorScoresWithEnrollment: pd.DataFrame) -> pd != c.enrollmentStateToThrift[c.earnedOutAcknowledged] ) & (contributorScoresWithEnrollment[c.enrollmentState] != c.enrollmentStateToThrift[c.newUser]) + & (contributorScoresWithEnrollment[c.enrollmentState] != c.enrollmentStateToThrift[c.removed]) ) contributorScoresWithEnrollment.loc[earnedOutUsers, c.numberOfTimesEarnedOutKey] = ( @@ -408,19 +427,19 @@ def get_contributor_state( ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, userEnrollment: pd.DataFrame, - logging: bool = True, + log: bool = True, ) -> pd.DataFrame: """ Given scored notes, ratings, note status history, the current user enrollment state, this - uses the contributor counts over ratings and notes and transitions the user between the different - enrollment states. + uses the contributor counts over ratings and notes and transitions the user between the + different enrollment states. If current user enrollment state is removed, do not change. Args: scoredNotes (pd.DataFrame): scored notes ratings (pd.DataFrame): all ratings noteStatusHistory (pd.DataFrame): history of note statuses userEnrollment (pd.DataFrame): User enrollment for BW participants. - logging (bool): Should we log + log (bool): Should we log Returns: pd.DataFrame: contributorScoresWithEnrollment The contributor scores with enrollments """ @@ -434,7 +453,11 @@ def get_contributor_state( # We need to consider only the last 5 notes for enrollment state. The ratings are aggregated historically. # For users who have earned out, we should only consider notes written since the earn out event scoredNotesWithLastEarnOut = scoredNotes.merge( - userEnrollment, left_on=c.noteAuthorParticipantIdKey, right_on=c.participantIdKey, how="left" + userEnrollment[[c.participantIdKey, c.timestampOfLastEarnOut]], + left_on=c.noteAuthorParticipantIdKey, + right_on=c.participantIdKey, + how="left", + unsafeAllowed=c.timestampOfLastEarnOut, ) # For users who don't appear in the userEnrollment file, set their timeStampOfLastEarnOut to default scoredNotesWithLastEarnOut[c.timestampOfLastEarnOut].fillna(1, inplace=True) @@ -462,6 +485,7 @@ def get_contributor_state( left_on=c.raterParticipantIdKey, right_on=c.noteAuthorParticipantIdKey, how="outer", + unsafeAllowed=c.hasCrnhSinceEarnOut, ).drop(columns=[c.noteAuthorParticipantIdKey]) with c.time_block("Contributor State: Emerging Writers"): @@ -472,12 +496,23 @@ def get_contributor_state( left_on=c.raterParticipantIdKey, right_on=c.noteAuthorParticipantIdKey, how="outer", + unsafeAllowed=c.isEmergingWriterKey, ).drop(columns=[c.noteAuthorParticipantIdKey]) with c.time_block("Contributor State: Combining"): # We merge the current enrollment state contributorScoresWithEnrollment = contributorScores.merge( - userEnrollment, left_on=c.raterParticipantIdKey, right_on=c.participantIdKey, how="outer" + userEnrollment, + left_on=c.raterParticipantIdKey, + right_on=c.participantIdKey, + how="outer", + unsafeAllowed={ + c.successfulRatingNeededToEarnIn, + c.timestampOfLastStateChange, + c.numberOfTimesEarnedOutKey, + "coreBool", + "expansionBool", + }, ) # We set the new contributor state. @@ -553,27 +588,22 @@ def get_contributor_state( # users that do not have an id. contributorScoresWithEnrollment.dropna(subset=[c.raterParticipantIdKey], inplace=True) - if logging: - print("Enrollment State") - print( - "Number of Earned In", - len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 0]), + if log: + logger.info("Enrollment State") + logger.info( + f"Number of Earned In {len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 0])}" ) - print( - "Number At Risk", - len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 1]), + logger.info( + f"Number At Risk {len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 1])}" ) - print( - "Number of Earn Out No Ack", - len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 2]), + logger.info( + f"Number of Earn Out No Ack {len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 2])}" ) - print( - "Number of Earned Out Ack", - len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 3]), + logger.info( + f"Number of Earned Out Ack {len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 3])}" ) - print( - "Number of New Users", - len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 4]), + logger.info( + f"Number of New Users {len(contributorScoresWithEnrollment[contributorScoresWithEnrollment[c.enrollmentState] == 4])}" ) return contributorScoresWithEnrollment, mappedUserEnrollment @@ -586,7 +616,7 @@ def get_contributor_scores( lastNNotes=-1, countNMRNotesLast: bool = False, sinceLastEarnOut: bool = False, - logging: bool = True, + log: bool = True, ) -> pd.DataFrame: """ Given the outputs of the MF model, this function aggregates stats over notes and ratings. The @@ -599,7 +629,7 @@ def get_contributor_scores( lastNNotes (int): count over the last n notes countNMRNotesLast (bool): count NMR notes last. Useful when you want to calculate over a limited set of CRH + CRNH notes sinceLastEarnOut: only count notes since last Earn Out event - logging (bool): Should we log? + log (bool): Should we log? Returns: pd.DataFrame: contributorScores - rating + note aggregates per contributor. """ @@ -608,7 +638,25 @@ def get_contributor_scores( scoredNotes, lastNNotes, countNMRNotesLast, sinceLastEarnOut ) contributorCounts = ( - visibleRatingCounts.join(visibleNoteCounts, lsuffix="note", rsuffix="rater", how="outer") + visibleRatingCounts.join( + visibleNoteCounts, + lsuffix="note", + rsuffix="rater", + how="outer", + unsafeAllowed={ + c.defaultIndexKey, + c.awaitingMoreRatingsBoolKey + "note", + c.ratingsAwaitingMoreRatings, + c.currentlyRatedHelpfulBoolKey, + c.currentlyRatedNotHelpfulBoolKey, + c.awaitingMoreRatingsBoolKey + "rater", + c.notesCurrentlyRatedHelpful, + c.notesCurrentlyRatedNotHelpful, + c.notesAwaitingMoreRatings, + c.numRatingsKey, + c.aggregateRatingReceivedTotal, + }, + ) .reset_index() .rename({"index": c.raterParticipantIdKey}, axis=1)[ [ @@ -629,7 +677,7 @@ def get_contributor_scores( ] ) - if logging: - print("Number Contributor Counts: ", len(contributorCounts)) + if log: + logger.info(f"Number Contributor Counts: {len(contributorCounts)}") return contributorCounts diff --git a/sourcecode/scoring/enums.py b/sourcecode/scoring/enums.py index cff342f0..ea1cf6cc 100644 --- a/sourcecode/scoring/enums.py +++ b/sourcecode/scoring/enums.py @@ -13,6 +13,7 @@ class Scorers(Enum): MFExpansionPlusScorer = auto() ReputationScorer = auto() MFTopicScorer = auto() + MFMultiGroupScorer = auto() class Topics(Enum): diff --git a/sourcecode/scoring/explanation_tags.py b/sourcecode/scoring/explanation_tags.py index 6d5c95a8..06ac4d31 100644 --- a/sourcecode/scoring/explanation_tags.py +++ b/sourcecode/scoring/explanation_tags.py @@ -1,5 +1,5 @@ from collections import Counter -from typing import List, Optional +from typing import List from . import constants as c @@ -7,46 +7,66 @@ import pandas as pd -def top_tags( - row: pd.Series, +def get_top_two_tags_for_note( + noteStats: pd.DataFrame, minRatingsToGetTag: int, - minTagsNeededForStatus: int, - tagsConsidered: Optional[List[str]] = None, -) -> pd.Series: - """Given a particular row of the scoredNotes DataFrame, determine which two - explanation tags to assign to the note based on its ratings. + minTagsNeededToGetStatus: int, + tagsConsideredInTiebreakOrder: List[str], +) -> pd.DataFrame: + """Given a scoredNotes DataFrame, determine which two + explanation tags to assign to each note based on its ratings. See https://twitter.github.io/communitynotes/ranking-notes/#determining-note-status-explanation-tags Args: - row (pd.Series): row of the scoredNotes dataframe, including a count of each tag + noteStats (pd.DataFrame): row of the scoredNotes dataframe, including a count of each tag minRatingsToGetTag (int): min ratings needed minTagsNeededForStatus (int): min tags needed before a note gets a status - tagsConsidered (list[str]): set of tags to consider for *all* notes + tagsConsideredInTiebreakOrder (list[str]): set of tags to consider for *all* notes Returns: - Tuple: return the whole row back, with rating tag fields filled in. + A dataframe back, filtered to just the rows that are assigned tags, with + c.firstTagKey and c.secondTagKey columns set, and the index set to noteId. """ - if tagsConsidered: - tagCounts = pd.DataFrame(row[tagsConsidered]) - elif row[c.finalRatingStatusKey] == c.currentlyRatedHelpful: - tagCounts = pd.DataFrame(row[c.helpfulTagsTiebreakOrder]) - elif row[c.finalRatingStatusKey] == c.currentlyRatedNotHelpful: - tagCounts = pd.DataFrame(row[c.notHelpfulTagsTiebreakOrder]) - else: - return row - tagCounts.columns = [c.tagCountsKey] - tagCounts[c.tiebreakOrderKey] = range(len(tagCounts)) - tagCounts = tagCounts[tagCounts[c.tagCountsKey] >= minRatingsToGetTag] - topTags = tagCounts.sort_values(by=[c.tagCountsKey, c.tiebreakOrderKey], ascending=False)[:2] - # Note: this currently only allows for minTagsNeededForStatus between 0-2 - if len(topTags) >= minTagsNeededForStatus: - if len(topTags): - row[c.firstTagKey] = topTags.index[0] - if len(topTags) > 1: - row[c.secondTagKey] = topTags.index[1] + assert ( + minTagsNeededToGetStatus == 2 + ), f"minTagsNeededToGetStatus was {minTagsNeededToGetStatus} but only implemented for minTagsNeededToGetStatus=2" + + with c.time_block("NH Tags: Top 2 per note"): + noteStats.set_index(c.noteIdKey, inplace=True) + noteTagTotals = noteStats[tagsConsideredInTiebreakOrder[::-1]] # Put winning tags at front. - return row + # Filter tags and apply minimum rating threshold + filteredTags = noteTagTotals.where(lambda x: x >= minRatingsToGetTag) + filteredTags.dropna( + thresh=minTagsNeededToGetStatus, inplace=True + ) # only keep rows with at least 2 non-NaN entries + + negativeTags = -1 * filteredTags.to_numpy(dtype=np.float64) + + # Create a small value for tie-breaking, proportional to the column indices + # The small value should be smaller than the smallest difference between any two elements (ints) + epsilon = 1e-3 + tieBreakers = np.arange(negativeTags.shape[1]) * epsilon + + # Add the tie_breaker to the array + negativeTieBrokenTags = tieBreakers + negativeTags + + # Fill nans with 0 (higher than all other values with nonzero tag counts) + negativeTieBrokenTags = np.nan_to_num(negativeTieBrokenTags) + + # Use argsort on the modified array + sortedIndices = np.argsort(negativeTieBrokenTags, axis=1) + + # Extract indices of the two largest values in each row + topTwoIndices = sortedIndices[:, :2] + noteTopTags = pd.DataFrame( + np.array(filteredTags.columns)[topTwoIndices], columns=[c.firstTagKey, c.secondTagKey] + ) + noteTopTags[c.noteIdKey] = filteredTags.index + noteTopTags.index = filteredTags.index + + return noteTopTags def get_top_nonhelpful_tags_per_author( @@ -78,7 +98,7 @@ def get_top_nonhelpful_tags_per_author( filteredTags = noteTagTotals.where(lambda x: x >= c.minRatingsToGetTag) filteredTags.dropna(thresh=2, inplace=True) # only keep rows with at least 2 non-NaN entries - negativeTags = -1 * filteredTags.to_numpy() + negativeTags = -1 * filteredTags.to_numpy(dtype=np.float64) # Create a small value for tie-breaking, proportional to the column indices # The small value should be smaller than the smallest difference between any two elements (ints) @@ -88,13 +108,16 @@ def get_top_nonhelpful_tags_per_author( # Add the tie_breaker to the array negativeTieBrokenTags = tieBreakers + negativeTags + # Fill nans with 0 (higher than all other values with nonzero tag counts) + negativeTieBrokenTags = np.nan_to_num(negativeTieBrokenTags) + # Use argsort on the modified array sortedIndices = np.argsort(negativeTieBrokenTags, axis=1) # Extract indices of the two largest values in each row topTwoIndices = sortedIndices[:, :2] noteTopTags = pd.DataFrame( - np.array(filteredTags.columns)[topTwoIndices], columns=["firstTag", "secondTag"] + np.array(filteredTags.columns)[topTwoIndices], columns=[c.firstTagKey, c.secondTagKey] ) noteTopTags[c.noteIdKey] = filteredTags.index diff --git a/sourcecode/scoring/helpfulness_scores.py b/sourcecode/scoring/helpfulness_scores.py index f74b884e..d909a818 100644 --- a/sourcecode/scoring/helpfulness_scores.py +++ b/sourcecode/scoring/helpfulness_scores.py @@ -1,3 +1,4 @@ +import logging from typing import Optional from . import constants as c @@ -6,6 +7,10 @@ import pandas as pd +logger = logging.getLogger("birdwatch.helpfulness_scores") +logger.setLevel(logging.INFO) + + def author_helpfulness( scoredNotes: pd.DataFrame, noteInterceptKey: str, @@ -101,7 +106,26 @@ def compute_general_helpfulness_scores( raterCounts = _rater_helpfulness(validRatings) helpfulnessScores = ( - authorCounts.join(raterCounts, how="outer", lsuffix="_author", rsuffix="_rater") + authorCounts.join( + raterCounts, + how="outer", + lsuffix="_author", + rsuffix="_rater", + unsafeAllowed={ + c.defaultIndexKey, + c.currentlyRatedHelpfulBoolKey, + c.currentlyRatedNotHelpfulBoolKey, + c.noteCountKey, + c.ratingAgreesWithNoteStatusKey, + # ratingCountKey was added with the migration to Pandas 2.2.2 because type checking showed + # a new conversion from int64 to float64. Given the outer join and hte data involved, that + # type conversion is actually expected. Additionally, we already have an exception for an + # int64 to float64 type conversion for ratingAgreesWithNoteStatusKey, which suggests the only + # reason we didn't see warnings for ratingCountKey before was that they type may have already + # been float64 going into the join. + c.ratingCountKey, + }, + ) .reset_index() .rename({"index": c.raterParticipantIdKey}, axis=1)[ [ @@ -135,12 +159,16 @@ def compute_general_helpfulness_scores( ) helpfulRatingsOnBadNotesCount = ( - helpfulRatingsOnBadNotes.groupby(c.raterParticipantIdKey) - .sum()[[c.totalHelpfulHarassmentRatingsPenaltyKey]] + helpfulRatingsOnBadNotes[[c.raterParticipantIdKey, c.totalHelpfulHarassmentRatingsPenaltyKey]] + .groupby(c.raterParticipantIdKey)[[c.totalHelpfulHarassmentRatingsPenaltyKey]] + .sum() .reset_index() ) helpfulnessScores = helpfulnessScores.merge( - helpfulRatingsOnBadNotesCount, on=c.raterParticipantIdKey, how="left" + helpfulRatingsOnBadNotesCount, + on=c.raterParticipantIdKey, + how="left", + unsafeAllowed=c.totalHelpfulHarassmentRatingsPenaltyKey, ) helpfulnessScores[c.totalHelpfulHarassmentRatingsPenaltyKey].fillna(0, inplace=True) @@ -176,7 +204,7 @@ def compute_general_helpfulness_scores( def filter_ratings_by_helpfulness_scores( ratingsForTraining: pd.DataFrame, helpfulnessScores: pd.DataFrame, - logging: bool = True, + log: bool = True, ): """Filter out ratings from raters whose helpfulness scores are too low. See https://twitter.github.io/communitynotes/contributor-scores/#filtering-ratings-based-on-helpfulness-scores. @@ -184,7 +212,7 @@ def filter_ratings_by_helpfulness_scores( Args: ratingsForTraining pandas.DataFrame: unfiltered input ratings helpfulnessScores pandas.DataFrame: helpfulness scores to use to determine which raters to filter out. - logging (bool, optional): debug output. Defaults to True. + log (bool, optional): debug output. Defaults to True. Returns: filtered_ratings pandas.DataFrame: same schema as input ratings, but filtered. @@ -196,15 +224,14 @@ def filter_ratings_by_helpfulness_scores( ratingsForTraining, on=c.raterParticipantIdKey ) - if logging: - print("Unique Raters: ", len(np.unique(ratingsForTraining[c.raterParticipantIdKey]))) - print("People (Authors or Raters) With Helpfulness Scores: ", len(helpfulnessScores)) - print("Raters Included Based on Helpfulness Scores: ", len(includedUsers)) - print( - "Included Raters who have rated at least 1 note in the final dataset: ", - len(np.unique(ratingsHelpfulnessScoreFiltered[c.raterParticipantIdKey])), + if log: + logger.info(f"Unique Raters: {len(np.unique(ratingsForTraining[c.raterParticipantIdKey]))}") + logger.info(f"People (Authors or Raters) With Helpfulness Scores: {len(helpfulnessScores)}") + logger.info(f"Raters Included Based on Helpfulness Scores: {len(includedUsers)}") + logger.info( + f"Included Raters who have rated at least 1 note in the final dataset: {len(np.unique(ratingsHelpfulnessScoreFiltered[c.raterParticipantIdKey]))}", ) - print("Number of Ratings Used For 1st Training: ", len(ratingsForTraining)) - print("Number of Ratings for Final Training: ", len(ratingsHelpfulnessScoreFiltered)) + logger.info(f"Number of Ratings Used For 1st Training: {len(ratingsForTraining)}") + logger.info(f"Number of Ratings for Final Training: {len(ratingsHelpfulnessScoreFiltered)}") return ratingsHelpfulnessScoreFiltered diff --git a/sourcecode/scoring/incorrect_filter.py b/sourcecode/scoring/incorrect_filter.py index 2a044185..a91ae6fa 100644 --- a/sourcecode/scoring/incorrect_filter.py +++ b/sourcecode/scoring/incorrect_filter.py @@ -8,49 +8,60 @@ import pandas as pd -def _get_user_incorrect_ratio(nhTagRatings: pd.DataFrame) -> pd.DataFrame: +def get_user_incorrect_ratio(ratings: pd.DataFrame) -> pd.DataFrame: """Computes empirical p(incorrect | not helpful tags assigned) per rater. + Called during prescoring only, since it uses entire rating history. Args: - nhTagRatings: DF containing all ratings with some NH tag + ratings: DF containing ratings. Returns: pd.DataFrame containing one row per user who assigned not helpful tags with their empirical propensity to assign "incorrect" tag """ + # Filter down to just ratings with some nh tags used. + nhTagRatings = ratings.loc[ratings[c.notHelpfulTagsTSVOrder].sum(axis=1) > 0] user_incorrect = ( - nhTagRatings[[c.raterParticipantIdKey, c.notHelpfulIncorrectTagKey]] - .groupby(c.raterParticipantIdKey) - .agg("sum") + ( + nhTagRatings[[c.raterParticipantIdKey, c.notHelpfulIncorrectTagKey]] + .groupby(c.raterParticipantIdKey) + .agg("sum") + ) + .rename(columns={c.notHelpfulIncorrectTagKey: c.incorrectTagRatingsMadeByRaterKey}) + .reset_index() ) + user_nh_rating_count = ( - nhTagRatings[[c.raterParticipantIdKey, c.noteIdKey]] - .groupby(c.raterParticipantIdKey) - .agg("count") + ( + nhTagRatings[[c.raterParticipantIdKey, c.noteIdKey]] + .groupby(c.raterParticipantIdKey) + .agg("count") + ) + .rename(columns={c.noteIdKey: c.totalRatingsMadeByRaterKey}) + .reset_index() ) - user_nh_rating_count.rename(columns={c.noteIdKey: "cnt"}, inplace=True) + user_totals = user_incorrect.merge(user_nh_rating_count, on=c.raterParticipantIdKey) return user_totals def _get_incorrect_tfidf_ratio( - augmented_ratings: pd.DataFrame, user_filter: Optional[bool], suffix: str + augmented_ratings: pd.DataFrame, interval_filter: Optional[bool] ) -> pd.DataFrame: """Computes empirical p(incorrect | note) / p(incorrect | raters over all notes) subject to rater-note inclusion function. Args: augmented_ratings: ratings DF with note and rater factors and user incorrect TF - filter: inclusion criteria for "incorrect" voters - suffix: suffix for incorrect and count column names for this filter + interval_filter: inclusion criteria: only keep the ratings where the rater and note factors are within a certain interval Returns: pd.DataFrame with one row for each note, with computed sum(tf_idf_incorrect) score for raters included in filter """ - if user_filter is not None: - ratings_w_user_totals = augmented_ratings[user_filter] + if interval_filter is not None: + ratings_w_user_totals = augmented_ratings[interval_filter] else: ratings_w_user_totals = augmented_ratings @@ -60,7 +71,7 @@ def _get_incorrect_tfidf_ratio( .agg("count") .reset_index() ) - note_nh_count.rename(columns={c.raterParticipantIdKey: "num_voters"}, inplace=True) + note_nh_count.rename(columns={c.raterParticipantIdKey: c.numVotersKey}, inplace=True) columns_to_attempt_to_drop = [ c.internalRaterFactor1Key, @@ -70,34 +81,61 @@ def _get_incorrect_tfidf_ratio( columns_to_drop = ratings_w_user_totals.columns.intersection(columns_to_attempt_to_drop) ratings_w_user_totals.drop(columns_to_drop, inplace=True, axis=1) - ratings_w_user_totals["p_incorrect_user"] = ( - ratings_w_user_totals["notHelpfulIncorrect_total"] / ratings_w_user_totals["cnt"] + ratings_w_user_totals[c.incorrectTagRateByRaterKey] = ( + ratings_w_user_totals[c.incorrectTagRatingsMadeByRaterKey] + / ratings_w_user_totals[c.totalRatingsMadeByRaterKey] ) + # Setup columns to be aggregated so they are not dropped during aggregation + ratings_w_user_totals[c.incorrectTagRateByRaterKey].fillna(0, inplace=True) + ratings_w_user_totals[c.incorrectTagRateByRaterKey] = ratings_w_user_totals[ + c.incorrectTagRateByRaterKey + ].astype(np.double) + ratings_w_user_totals[c.incorrectTagRatingsMadeByRaterKey].fillna(0, inplace=True) + ratings_w_user_totals[c.incorrectTagRatingsMadeByRaterKey] = ratings_w_user_totals[ + c.incorrectTagRatingsMadeByRaterKey + ].astype(np.double) + ratings_w_user_totals[c.totalRatingsMadeByRaterKey].fillna(0, inplace=True) + ratings_w_user_totals[c.totalRatingsMadeByRaterKey] = ratings_w_user_totals[ + c.totalRatingsMadeByRaterKey + ].astype(np.double) + rating_aggs = ratings_w_user_totals.groupby(c.noteIdKey).agg("sum").reset_index() rating_aggs_w_cnt = rating_aggs.merge(note_nh_count, on=c.noteIdKey) - rating_aggs_w_cnt["tf_idf_incorrect"] = (rating_aggs_w_cnt[c.notHelpfulIncorrectTagKey]) / np.log( - 1 + (rating_aggs_w_cnt["p_incorrect_user"]) + rating_aggs_w_cnt[c.noteTfIdfIncorrectScoreKey] = ( + rating_aggs_w_cnt[c.notHelpfulIncorrectTagKey] + ) / np.log( + 1 + (rating_aggs_w_cnt[c.incorrectTagRateByRaterKey]) ) # p(incorrect over all rater ratings) - rating_aggs_w_cnt.drop(["notHelpfulIncorrect_total", "cnt"], inplace=True, axis=1) - rating_aggs_w_cnt.columns = [c.noteIdKey] + [ - f"{col}{suffix}" for col in rating_aggs_w_cnt.columns[1:] - ] + rating_aggs_w_cnt.drop( + [c.totalRatingsMadeByRaterKey, c.incorrectTagRatingsMadeByRaterKey], inplace=True, axis=1 + ) + + rating_aggs_w_cnt.rename( + columns={ + c.notHelpfulIncorrectTagKey: c.notHelpfulIncorrectIntervalKey, + c.incorrectTagRateByRaterKey: c.sumOfIncorrectTagRateByRaterIntervalKey, + c.numVotersKey: c.numVotersIntervalKey, + c.noteTfIdfIncorrectScoreKey: c.noteTfIdfIncorrectScoreIntervalKey, + }, + inplace=True, + ) return rating_aggs_w_cnt -def get_incorrect_aggregates( +def get_incorrect_aggregates_final_scoring( ratings: pd.DataFrame, noteParams: pd.DataFrame, - raterParams: pd.DataFrame, + raterParamsWithRatingCounts: pd.DataFrame, ) -> pd.DataFrame: - """Computes non-helpful tag aggregates for each note. + """Computes non-helpful tag aggregates for each note. Intended to be called in final scoring. Args: ratings: initial input ratings DF containing all ratings noteParams: MF results for notes - raterParams: MF results for raters + raterParamsWithRatingCounts: MF results for raters. Should include c.incorrectTagRatingsMadeByRaterKey and c.totalRatingsMadeByRaterKey. + raterIncorrectTagRatingCounts: should contain: c.raterParticipantIdKey, Returns: pd.DataFrame containing one row per note that was scored during MF. Columns correspond to @@ -107,18 +145,24 @@ def get_incorrect_aggregates( # consider only ratings with some NH tag notHelpfulTaggedRatings = ratings.loc[ratings[c.notHelpfulTagsTSVOrder].sum(axis=1) > 0] - # get per user incorrect term frequency - user_totals = _get_user_incorrect_ratio(notHelpfulTaggedRatings) - # add user and note factors + # join user totals, note factors, and rater factors with each rating ratings_w_user_totals = ( notHelpfulTaggedRatings[[c.raterParticipantIdKey, c.noteIdKey, c.notHelpfulIncorrectTagKey]] - .merge(user_totals, on=c.raterParticipantIdKey, suffixes=(None, "_total")) .merge(noteParams[[c.noteIdKey, c.internalNoteFactor1Key]], on=c.noteIdKey) .merge( - raterParams[[c.raterParticipantIdKey, c.internalRaterFactor1Key]], on=c.raterParticipantIdKey + raterParamsWithRatingCounts[ + [ + c.raterParticipantIdKey, + c.internalRaterFactor1Key, + c.incorrectTagRatingsMadeByRaterKey, + c.totalRatingsMadeByRaterKey, + ] + ], + on=c.raterParticipantIdKey, ) ) + # Keep users with clipped factors within a certain interval of notes' (e.g. within 0.3) interval_filter = ( np.abs( ratings_w_user_totals[c.internalRaterFactor1Key].clip(-0.4, 0.4) @@ -127,7 +171,20 @@ def get_incorrect_aggregates( < c.intervalHalfWidth ) - incorrectAggregates = _get_incorrect_tfidf_ratio( - ratings_w_user_totals, interval_filter, "_interval" - ) + incorrectAggregates = _get_incorrect_tfidf_ratio(ratings_w_user_totals, interval_filter) return incorrectAggregates + + +def get_incorrect_aggregates( + ratings: pd.DataFrame, + noteParams: pd.DataFrame, + raterParams: pd.DataFrame, +) -> pd.DataFrame: + """ + Legacy version of this function, computable all at once instead of being called separately in prescoring + vs. final scoring. + """ + # get per user incorrect term frequency -- normally called during prescoring + raterParamsWithRatingCounts = raterParams.merge(get_user_incorrect_ratio(ratings)) + + return get_incorrect_aggregates_final_scoring(ratings, noteParams, raterParamsWithRatingCounts) diff --git a/sourcecode/scoring/matrix_factorization/matrix_factorization.py b/sourcecode/scoring/matrix_factorization/matrix_factorization.py index d7617f5a..cf0bde9c 100644 --- a/sourcecode/scoring/matrix_factorization/matrix_factorization.py +++ b/sourcecode/scoring/matrix_factorization/matrix_factorization.py @@ -1,4 +1,5 @@ import dataclasses +import logging from typing import List, Optional, Tuple from .. import constants as c @@ -10,6 +11,10 @@ import torch +logger = logging.getLogger("birdwatch.matrix_factorization") +logger.setLevel(logging.INFO) + + @dataclasses.dataclass class Constants: noteIndexKey = "noteIndex" @@ -24,8 +29,7 @@ def __init__( convergence=1e-7, numFactors=1, useGlobalIntercept=True, - logging=True, - flipFactorsForIdentification=True, + log=True, model: Optional[BiasedMatrixFactorization] = None, featureCols: Optional[List[str]] = None, labelCol: str = c.helpfulNumKey, @@ -46,8 +50,7 @@ def __init__( self._convergence = convergence self._numFactors = numFactors self._useGlobalIntercept = useGlobalIntercept - self._logging = logging - self._flipFactorsForIdentification = flipFactorsForIdentification + self._log = log self._featureCols = featureCols self._labelCol = labelCol self._useSigmoidCrossEntropy = useSigmoidCrossEntropy @@ -63,14 +66,14 @@ def __init__( if self._useSigmoidCrossEntropy: if self._posWeight: - if logging: - print(f"Using pos weight: {self._posWeight} with BCEWithLogitsLoss") + if log: + logger.info(f"Using pos weight: {self._posWeight} with BCEWithLogitsLoss") self.criterion = torch.nn.BCEWithLogitsLoss( - pos_weight=torch.Tensor(np.array(self._posWeight)), reduction="none" + pos_weight=torch.FloatTensor(np.array(self._posWeight)), reduction="none" ) else: - if logging: - print("Using BCEWithLogitsLoss") + if log: + logger.info("Using BCEWithLogitsLoss") self.criterion = torch.nn.BCEWithLogitsLoss(reduction="none") else: if self._posWeight: @@ -85,6 +88,9 @@ def __init__( self.trainModelData: Optional[ModelData] = None self.validateModelData: Optional[ModelData] = None + self._ratingPerNoteLossRatio: Optional[float] = None + self._ratingPerUserLossRatio: Optional[float] = None + def get_final_train_error(self) -> Optional[float]: return self.train_errors[-1] if self.train_errors else None @@ -95,8 +101,7 @@ def get_new_mf_with_same_args(self): convergence=self._convergence, numFactors=self._numFactors, useGlobalIntercept=self._useGlobalIntercept, - logging=self._logging, - flipFactorsForIdentification=self._flipFactorsForIdentification, + log=self._log, model=None, featureCols=self._featureCols, labelCol=self._labelCol, @@ -171,9 +176,14 @@ def _initialize_parameters( """ assert self.mf_model is not None if noteInit is not None: - if self._logging: - print("initializing notes") - noteInit = self.noteIdMap.merge(noteInit, on=c.noteIdKey, how="left") + if self._log: + logger.info("initializing notes") + noteInit = self.noteIdMap.merge( + noteInit, + on=c.noteIdKey, + how="left", + unsafeAllowed={c.noteIdKey, "noteIndex_y"}, + ) noteInit[c.internalNoteInterceptKey].fillna(0.0, inplace=True) self.mf_model.note_intercepts.weight.data = torch.tensor( @@ -189,8 +199,8 @@ def _initialize_parameters( ) if userInit is not None: - if self._logging: - print("initializing users") + if self._log: + logger.info("initializing users") userInit = self.raterIdMap.merge(userInit, on=c.raterParticipantIdKey, how="left") userInit[c.internalRaterInterceptKey].fillna(0.0, inplace=True) @@ -207,13 +217,15 @@ def _initialize_parameters( ) if globalInterceptInit is not None: - if self._logging: - print("initialized global intercept") + if self._log: + logger.info("initialized global intercept") self.mf_model.global_intercept = torch.nn.parameter.Parameter( - torch.ones(1, 1) * globalInterceptInit + torch.ones(1, 1, dtype=torch.float32) * globalInterceptInit ) - def _get_parameters_from_trained_model(self) -> Tuple[pd.DataFrame, pd.DataFrame]: + def _get_parameters_from_trained_model( + self, flipFactorsForIdentification: bool = True + ) -> Tuple[pd.DataFrame, pd.DataFrame]: """ Returns: Tuple[pd.DataFrame, pd.DataFrame]: noteIdMap, raterIdMap @@ -235,7 +247,7 @@ def _get_parameters_from_trained_model(self) -> Tuple[pd.DataFrame, pd.DataFrame :, i ] - if self._flipFactorsForIdentification: + if flipFactorsForIdentification: noteParams, raterParams = self._flip_factors_for_identification(noteParams, raterParams) return noteParams, raterParams @@ -261,15 +273,15 @@ def _create_mf_model( self._initialize_parameters(noteInit, userInit, globalInterceptInit) if (noteInit is not None) and (userInit is not None): - print(f"learning rate set to :{self._initLearningRate}") + logger.info(f"learning rate set to :{self._initLearningRate}") self.optimizer = torch.optim.Adam( self.mf_model.parameters(), lr=self._initLearningRate ) # smaller learning rate else: - print(f"learning rate set to :{self._noInitLearningRate}") + logger.info(f"learning rate set to :{self._noInitLearningRate}") self.optimizer = torch.optim.Adam(self.mf_model.parameters(), lr=self._noInitLearningRate) - if self._logging: - print(self.mf_model.device) + if self._log: + logger.info(f"{self.mf_model.device}") self.mf_model.to(self.mf_model.device) def _instantiate_biased_mf_model(self): @@ -280,11 +292,11 @@ def _instantiate_biased_mf_model(self): n_notes, use_global_intercept=self._useGlobalIntercept, n_factors=self._numFactors, - logging=self._logging, + log=self._log, ) - if self._logging: - print("------------------") - print(f"Users: {n_users}, Notes: {n_notes}") + if self._log: + logger.info("------------------") + logger.info(f"Users: {n_users}, Notes: {n_notes}") def _compute_and_print_loss( self, @@ -304,11 +316,11 @@ def _compute_and_print_loss( else: validate_loss_value = None - if self._logging: - print("epoch", epoch, loss_value) - print("TRAIN FIT LOSS: ", train_loss_value) + if self._log: + logger.info(f"epoch {epoch} {loss_value}") + logger.info(f"TRAIN FIT LOSS: {train_loss_value}") if validate_loss_value is not None: - print("VALIDATE FIT LOSS: ", validate_loss_value) + logger.info(f"VALIDATE FIT LOSS: {validate_loss_value}") if final == True: self.test_errors.append(loss_value) @@ -350,19 +362,60 @@ def _get_loss(self, epoch: Optional[int] = None): assert self.trainModelData is not None loss = self.criterion(y_pred, self.trainModelData.rating_labels).mean() regularizationLoss = self._get_reg_loss() - return loss + regularizationLoss + loss += regularizationLoss + assert not torch.isnan(loss).any() + return loss def _get_reg_loss(self): - l2_reg_loss = torch.tensor(0.0).to(self.mf_model.device) - l2_reg_loss += self._userFactorLambda * (self.mf_model.user_factors.weight**2).mean() - l2_reg_loss += self._noteFactorLambda * (self.mf_model.note_factors.weight**2).mean() - l2_reg_loss += self._userInterceptLambda * (self.mf_model.user_intercepts.weight**2).mean() - l2_reg_loss += self._noteInterceptLambda * (self.mf_model.note_intercepts.weight**2).mean() + l2_reg_loss = torch.tensor(0.0, dtype=torch.float32).to(self.mf_model.device) + + if self._ratingPerUserLossRatio is None: + l2_reg_loss += self._userFactorLambda * (self.mf_model.user_factors.weight**2).mean() + l2_reg_loss += self._userInterceptLambda * (self.mf_model.user_intercepts.weight**2).mean() + else: + simulatedNumberOfRatersForLoss = ( + len(self.trainModelData.rating_labels) / self._ratingPerUserLossRatio + ) + l2_reg_loss += ( + self._userFactorLambda + * (self.mf_model.user_factors.weight**2).sum() + / simulatedNumberOfRatersForLoss + ) + l2_reg_loss += ( + self._userInterceptLambda + * (self.mf_model.user_intercepts.weight**2).sum() + / simulatedNumberOfRatersForLoss + ) + + if self._ratingPerNoteLossRatio is None: + l2_reg_loss += self._noteFactorLambda * (self.mf_model.note_factors.weight**2).mean() + l2_reg_loss += self._noteInterceptLambda * (self.mf_model.note_intercepts.weight**2).mean() + l2_reg_loss += ( + self._diamondLambda + * (self.mf_model.note_factors.weight * self.mf_model.note_intercepts.weight).abs().mean() + ) + else: + simulatedNumberOfNotesForLoss = ( + len(self.trainModelData.rating_labels) / self._ratingPerNoteLossRatio + ) + l2_reg_loss += ( + self._noteFactorLambda + * (self.mf_model.note_factors.weight**2).sum() + / simulatedNumberOfNotesForLoss + ) + l2_reg_loss += ( + self._noteInterceptLambda + * (self.mf_model.note_intercepts.weight**2).sum() + / simulatedNumberOfNotesForLoss + ) + l2_reg_loss += ( + self._diamondLambda + * (self.mf_model.note_factors.weight * self.mf_model.note_intercepts.weight).abs().sum() + / simulatedNumberOfNotesForLoss + ) + l2_reg_loss += self._globalInterceptLambda * (self.mf_model.global_intercept**2).mean() - l2_reg_loss += ( - self._diamondLambda - * (self.mf_model.note_factors.weight * self.mf_model.note_intercepts.weight).abs().mean() - ) + return l2_reg_loss def _fit_model( @@ -410,8 +463,8 @@ def _fit_model( epoch += 1 - if self._logging: - print("Num epochs:", epoch) + if self._log: + logger.info("Num epochs: {epoch}") return self._compute_and_print_loss(loss.item(), epoch, final=True) def prepare_features_and_labels( @@ -428,10 +481,10 @@ def prepare_features_and_labels( rating_labels = torch.FloatTensor(ratingFeaturesAndLabels[self._labelCol].values).to( self.mf_model.device ) - user_indexes = torch.LongTensor(ratingFeaturesAndLabels[Constants.raterIndexKey].values).to( + user_indexes = torch.IntTensor(ratingFeaturesAndLabels[Constants.raterIndexKey].values).to( self.mf_model.device ) - note_indexes = torch.LongTensor(ratingFeaturesAndLabels[Constants.noteIndexKey].values).to( + note_indexes = torch.IntTensor(ratingFeaturesAndLabels[Constants.noteIndexKey].values).to( self.mf_model.device ) self.modelData = ModelData(rating_labels, user_indexes, note_indexes) @@ -444,7 +497,12 @@ def run_mf( globalInterceptInit: Optional[float] = None, specificNoteId: Optional[int] = None, validatePercent: Optional[float] = None, + freezeNoteParameters: bool = False, freezeRaterParameters: bool = False, + freezeGlobalParameters: bool = False, + ratingPerNoteLossRatio: Optional[float] = None, + ratingPerUserLossRatio: Optional[float] = None, + flipFactorsForIdentification: bool = True, ): """Train matrix factorization model. @@ -463,20 +521,42 @@ def run_mf( raterParams: contains one row per rating, including raterId and learned rater parameters globalIntercept: learned global intercept parameter """ + self._ratingPerNoteLossRatio = ratingPerNoteLossRatio + self._ratingPerUserLossRatio = ratingPerUserLossRatio + self._initialize_note_and_rater_id_maps(ratings) self._create_mf_model(noteInit, userInit, globalInterceptInit) assert self.mf_model is not None + logger.info( + f"Ratings per note in dataset: {len(ratings)/self.mf_model.note_factors.weight.data.shape[0]}" + ) + logger.info( + f"Ratings per user in dataset: {len(ratings)/self.mf_model.user_factors.weight.data.shape[0]}" + ) + if ratingPerNoteLossRatio is not None: + logger.info( + f"Correcting loss function to simulate rating per note loss ratio = {ratingPerNoteLossRatio}" + ) + if ratingPerUserLossRatio is not None: + logger.info( + f"Correcting loss function to simulate rating per user loss ratio = {ratingPerUserLossRatio}" + ) + if freezeRaterParameters: self.mf_model._freeze_parameters(set({"user"})) + if freezeGlobalParameters: + self.mf_model._freeze_parameters(set({"global"})) + if freezeNoteParameters: + self.mf_model._freeze_parameters(set({"note"})) if specificNoteId is not None: self.mf_model.freeze_rater_and_global_parameters() self.prepare_features_and_labels(specificNoteId) train_loss, loss, validate_loss = self._fit_model(validatePercent) if self._normalizedLossHyperparameters is not None: - _, raterParams = self._get_parameters_from_trained_model() + _, raterParams = self._get_parameters_from_trained_model(flipFactorsForIdentification) assert self.modelData is not None self._lossModule = NormalizedLoss( self.criterion, @@ -495,11 +575,13 @@ def run_mf( globalIntercept = None if self._useGlobalIntercept: - globalIntercept = self.mf_model.global_intercept - if self._logging: - print("Global Intercept: ", globalIntercept.item()) + globalIntercept = self.mf_model.global_intercept.item() + if self._log: + logger.info(f"Global Intercept: {globalIntercept}") - fitNoteParams, fitRaterParams = self._get_parameters_from_trained_model() + fitNoteParams, fitRaterParams = self._get_parameters_from_trained_model( + flipFactorsForIdentification + ) fitRaterParams.drop(Constants.raterIndexKey, axis=1, inplace=True) if validatePercent is None: diff --git a/sourcecode/scoring/matrix_factorization/model.py b/sourcecode/scoring/matrix_factorization/model.py index 49632453..b67c9ad1 100644 --- a/sourcecode/scoring/matrix_factorization/model.py +++ b/sourcecode/scoring/matrix_factorization/model.py @@ -1,14 +1,19 @@ from dataclasses import dataclass +import logging from typing import Optional import torch +logger = logging.getLogger("birdwatch.model") +logger.setLevel(logging.INFO) + + @dataclass class ModelData: rating_labels: Optional[torch.FloatTensor] - user_indexes: Optional[torch.LongTensor] - note_indexes: Optional[torch.LongTensor] + user_indexes: Optional[torch.IntTensor] + note_indexes: Optional[torch.IntTensor] class BiasedMatrixFactorization(torch.nn.Module): @@ -20,7 +25,7 @@ def __init__( n_notes: int, n_factors: int = 1, use_global_intercept: bool = True, - logging: bool = True, + log: bool = True, ) -> None: """Initialize matrix factorization model using xavier_uniform for factors and zeros for intercepts. @@ -33,16 +38,16 @@ def __init__( """ super().__init__() - self._logging = logging + self._log = log - self.user_factors = torch.nn.Embedding(n_users, n_factors, sparse=False) - self.note_factors = torch.nn.Embedding(n_notes, n_factors, sparse=False) + self.user_factors = torch.nn.Embedding(n_users, n_factors, sparse=False, dtype=torch.float32) + self.note_factors = torch.nn.Embedding(n_notes, n_factors, sparse=False, dtype=torch.float32) - self.user_intercepts = torch.nn.Embedding(n_users, 1, sparse=False) - self.note_intercepts = torch.nn.Embedding(n_notes, 1, sparse=False) + self.user_intercepts = torch.nn.Embedding(n_users, 1, sparse=False, dtype=torch.float32) + self.note_intercepts = torch.nn.Embedding(n_notes, 1, sparse=False, dtype=torch.float32) self.use_global_intercept = use_global_intercept - self.global_intercept = torch.nn.parameter.Parameter(torch.zeros(1, 1)) + self.global_intercept = torch.nn.parameter.Parameter(torch.zeros(1, 1, dtype=torch.float32)) torch.nn.init.xavier_uniform_(self.user_factors.weight) torch.nn.init.xavier_uniform_(self.note_factors.weight) self.user_intercepts.weight.data.fill_(0.0) @@ -81,6 +86,6 @@ def _freeze_parameters(self, words_to_freeze: set): for name, param in self.named_parameters(): for word in words_to_freeze: if word in name: - if self._logging: - print("Freezing parameter: ", name) + if self._log: + logger.info(f"Freezing parameter: {name}") param.requires_grad_(False) diff --git a/sourcecode/scoring/matrix_factorization/normalized_loss.py b/sourcecode/scoring/matrix_factorization/normalized_loss.py index af2e3ca1..ea7774b2 100644 --- a/sourcecode/scoring/matrix_factorization/normalized_loss.py +++ b/sourcecode/scoring/matrix_factorization/normalized_loss.py @@ -125,9 +125,16 @@ def __init__( # Finalize weights weightMap = dict( ((rater, note), weight) - for (rater, note, weight) in ratings[[c.raterParticipantIdKey, c.noteIdKey, "weights"]].values + for (rater, note, weight) in zip( + ratings[c.raterParticipantIdKey], ratings[c.noteIdKey], ratings["weights"] + ) + ) + self.weights = torch.FloatTensor( + [ + weightMap[(rater, note)] + for (rater, note) in zip(ratingOrder[c.raterParticipantIdKey], ratingOrder[c.noteIdKey]) + ] ) - self.weights = torch.tensor([weightMap[(rater, note)] for (rater, note) in ratingOrder.values]) assert len(self.weights) == len(self.targets) def forward(self, pred): diff --git a/sourcecode/scoring/matrix_factorization/pseudo_raters.py b/sourcecode/scoring/matrix_factorization/pseudo_raters.py index f0eada72..cea315be 100644 --- a/sourcecode/scoring/matrix_factorization/pseudo_raters.py +++ b/sourcecode/scoring/matrix_factorization/pseudo_raters.py @@ -1,12 +1,18 @@ from dataclasses import dataclass +import logging from .. import constants as c from .matrix_factorization import Constants as mf_c, MatrixFactorization +import numpy as np import pandas as pd import torch +logger = logging.getLogger("birdwatch.pseudo_raters") +logger.setLevel(logging.INFO) + + @dataclass class Constants: extraRaterInterceptKey = "extraRaterIntercept" @@ -27,10 +33,10 @@ def __init__( raterParams: pd.DataFrame, globalBias: float, mfRanker: MatrixFactorization, - logging=True, + log=True, checkParamsSame=True, ): - self._logging = logging + self._log = log self._mfRanker = mfRanker self._checkParamsSame = checkParamsSame self.ratings = ratings @@ -81,7 +87,9 @@ def _check_note_parameters_same(self, newMatrixFactorization: MatrixFactorizatio ( noteParamsFromNewModel, raterParamsFromNewModel, - ) = newMatrixFactorization._get_parameters_from_trained_model() + ) = newMatrixFactorization._get_parameters_from_trained_model( + flipFactorsForIdentification=False + ) assert (noteParamsFromNewModel == self.noteParams).all().all() def _make_extreme_raters(self, raterParams: pd.DataFrame, raterIdMap: pd.DataFrame): @@ -136,7 +144,8 @@ def _add_extreme_raters_to_id_maps_and_params(self): mf_c.raterIndexKey: [raterDict[mf_c.raterIndexKey]], } ), - ] + ], + unsafeAllowed=c.raterParticipantIdKey, ) if not ( @@ -152,7 +161,12 @@ def _add_extreme_raters_to_id_maps_and_params(self): c.internalRaterFactor1Key: [raterDict[c.internalRaterFactor1Key]], } ), - ] + ], + unsafeAllowed={ + c.raterParticipantIdKey, + c.internalRaterInterceptKey, + c.internalRaterFactor1Key, + }, ) def _create_new_model_with_extreme_raters_from_original_params( @@ -194,7 +208,9 @@ def _fit_all_notes_with_raters_constant(self, ratingFeaturesAndLabelsWithExtreme if self._checkParamsSame: self._check_rater_parameters_same(newExtremeMF) - fitNoteParams, fitRaterParams = newExtremeMF._get_parameters_from_trained_model() + fitNoteParams, fitRaterParams = newExtremeMF._get_parameters_from_trained_model( + flipFactorsForIdentification=False + ) return fitNoteParams def _create_extreme_ratings(self): @@ -230,6 +246,13 @@ def _create_dataset_with_extreme_rating_on_each_note(self, ratingToAddWithoutNot extremeRatingsToAdd = pd.DataFrame(ratingsWithNoteIds).drop( [c.internalRaterInterceptKey, c.internalRaterFactor1Key], axis=1 ) + extremeRatingsToAdd[c.noteIdKey] = extremeRatingsToAdd[c.noteIdKey].astype(np.int64) + if isinstance(self.ratingFeaturesAndLabels[c.raterParticipantIdKey].dtype, pd.Int64Dtype): + # Only convert ID type from string to Int64 if is necessary to match existing IDs (which is + # expected when running in prod, but not always in unit tests or public data.) + extremeRatingsToAdd[c.raterParticipantIdKey] = extremeRatingsToAdd[ + c.raterParticipantIdKey + ].astype(pd.Int64Dtype()) ratingFeaturesAndLabelsWithExtremeRatings = pd.concat( [self.ratingFeaturesAndLabels, extremeRatingsToAdd] ) @@ -244,9 +267,9 @@ def _fit_note_params_for_each_dataset_with_extreme_ratings(self): self._create_dataset_with_extreme_rating_on_each_note(ratingToAddWithoutNoteId) ) - if self._logging: - print("------------------") - print(f"Re-scoring all notes with extra rating added: {ratingToAddWithoutNoteId}") + if self._log: + logger.info("------------------") + logger.info(f"Re-scoring all notes with extra rating added: {ratingToAddWithoutNoteId}") with c.time_block("Pseudo: fit all notes with raters constant"): fitNoteParams = self._fit_all_notes_with_raters_constant( @@ -264,7 +287,14 @@ def _fit_note_params_for_each_dataset_with_extreme_ratings(self): return noteParamsList def _aggregate_note_params(self, noteParamsList, joinOrig=False): - rawRescoredNotesWithEachExtraRater = pd.concat(noteParamsList) + rawRescoredNotesWithEachExtraRater = pd.concat( + noteParamsList, + unsafeAllowed={ + Constants.extraRaterInterceptKey, + Constants.extraRaterFactor1Key, + Constants.extraRatingHelpfulNumKey, + }, + ) rawRescoredNotesWithEachExtraRater.drop(mf_c.noteIndexKey, axis=1, inplace=True) rawRescoredNotesWithEachExtraRater = rawRescoredNotesWithEachExtraRater.sort_values( by=[c.noteIdKey, Constants.extraRaterInterceptKey] diff --git a/sourcecode/scoring/mf_base_scorer.py b/sourcecode/scoring/mf_base_scorer.py index bc9031b1..9e8812b8 100644 --- a/sourcecode/scoring/mf_base_scorer.py +++ b/sourcecode/scoring/mf_base_scorer.py @@ -1,9 +1,23 @@ -from typing import List, Optional, Tuple - -from . import constants as c, helpfulness_scores, note_ratings, process_data, tag_consensus +import gc +import logging +from typing import Dict, List, Optional, Set, Tuple + +from . import ( + constants as c, + helpfulness_scores, + note_ratings, + process_data, + tag_consensus, + tag_filter, +) +from .incorrect_filter import get_user_incorrect_ratio from .matrix_factorization.matrix_factorization import MatrixFactorization from .matrix_factorization.pseudo_raters import PseudoRatersRunner -from .reputation_matrix_factorization.diligence_model import get_low_diligence_intercepts +from .pandas_utils import keep_columns +from .reputation_matrix_factorization.diligence_model import ( + fit_low_diligence_model_final, + fit_low_diligence_model_prescoring, +) from .scorer import Scorer import numpy as np @@ -11,6 +25,10 @@ import torch +logger = logging.getLogger("birdwatch.mf_base_scorer") +logger.setLevel(logging.INFO) + + def coalesce_columns(df: pd.DataFrame, columnPrefix: str) -> pd.DataFrame: """Condense all columns beginning with columnPrefix into a single column. @@ -88,8 +106,11 @@ def get_ratings_for_stable_init( # Only include notes that have received at least 75% of their ratings from the modeling group (and 5 total) ratingsForTrainingWithModelingGroup[c.ratingCountKey] = 1 noteStatsByRatedModelingGroup = ( - ratingsForTrainingWithModelingGroup.groupby(c.noteIdKey) - .sum()[[c.ratingFromInitialModelingGroupKey, c.ratingCountKey]] + ratingsForTrainingWithModelingGroup[ + [c.noteIdKey, c.ratingFromInitialModelingGroupKey, c.ratingCountKey] + ] + .groupby(c.noteIdKey) + .sum() .reset_index() ) noteStatsByRatedModelingGroup[c.percentFromInitialModelingGroupKey] = ( @@ -129,6 +150,10 @@ class MFBaseScorer(Scorer): def __init__( self, + includedTopics: Set[str] = set(), + includedGroups: Set[int] = set(), + includeUnassigned: bool = False, + captureThreshold: Optional[float] = None, seed: Optional[int] = None, pseudoraters: Optional[bool] = True, minNumRatingsPerRater: int = 10, @@ -142,7 +167,7 @@ def __init__( crnhThresholdNoteFactorMultiplier: float = -0.8, crnhThresholdNMIntercept: float = -0.15, crnhThresholdUCBIntercept: float = -0.04, - crhSuperThreshold: float = 0.5, + crhSuperThreshold: Optional[float] = 0.5, lowDiligenceThreshold: float = 0.263, factorThreshold: float = 0.5, inertiaDelta: float = 0.01, @@ -162,10 +187,15 @@ def __init__( minimumHarassmentScoreToPenalize: float = 2.0, tagConsensusHarassmentHelpfulRatingPenalty: int = 10, useReputation: bool = True, + tagFilterPercentile: int = 95, + incorrectFilterThreshold: float = 2.5, + firmRejectThreshold: Optional[float] = None, ): """Configure MatrixFactorizationScorer object. Args: + includedGroups: if set, filter ratings and results based on includedGroups + includedTopics: if set, filter ratings based on includedTopics seed: if not None, seed value to ensure deterministic execution pseudoraters: if True, compute optional pseudorater confidence intervals minNumRatingsPerRater: Minimum number of ratings which a rater must produce to be @@ -198,7 +228,14 @@ def __init__( maxFirstMFTrainError: maximum error allowed for the first MF training process maxFinalMFTrainError: maximum error allowed for the final MF training process """ - super().__init__(seed, threads) + super().__init__( + includedTopics=includedTopics, + includedGroups=includedGroups, + includeUnassigned=includeUnassigned, + captureThreshold=captureThreshold, + seed=seed, + threads=threads, + ) self._pseudoraters = pseudoraters self._minNumRatingsPerRater = minNumRatingsPerRater self._minNumRatersPerNote = minNumRatersPerNote @@ -223,6 +260,9 @@ def __init__( self.minimumHarassmentScoreToPenalize = minimumHarassmentScoreToPenalize self.tagConsensusHarassmentHelpfulRatingPenalty = tagConsensusHarassmentHelpfulRatingPenalty self._useReputation = useReputation + self._tagFilterPercentile = tagFilterPercentile + self._incorrectFilterThreshold = incorrectFilterThreshold + self._firmRejectThreshold = firmRejectThreshold mfArgs = dict( [ pair @@ -268,7 +308,17 @@ def get_crh_threshold(self) -> float: def get_scored_notes_cols(self) -> List[str]: """Returns a list of columns which should be present in the scoredNotes output.""" - return self.get_internal_scored_notes_cols() + return [ + c.noteIdKey, + c.internalNoteInterceptKey, + c.internalNoteFactor1Key, + c.internalRatingStatusKey, + c.internalActiveRulesKey, + c.activeFilterTagsKey, + c.noteInterceptMaxKey, + c.noteInterceptMinKey, + c.numFinalRoundRatingsKey, + ] def get_internal_scored_notes_cols(self) -> List[str]: """Returns a list of internal columns which should be present in the scoredNotes output.""" @@ -281,11 +331,22 @@ def get_internal_scored_notes_cols(self) -> List[str]: c.activeFilterTagsKey, c.noteInterceptMaxKey, c.noteInterceptMinKey, + c.numFinalRoundRatingsKey, + c.lowDiligenceNoteInterceptKey, + c.lowDiligenceNoteFactor1Key, ] def get_helpfulness_scores_cols(self) -> List[str]: """Returns a list of columns which should be present in the helpfulnessScores output.""" - return self.get_internal_helpfulness_scores_cols() + return [ + c.raterParticipantIdKey, + c.internalRaterInterceptKey, + c.internalRaterFactor1Key, + c.crhCrnhRatioDifferenceKey, + c.meanNoteScoreKey, + c.raterAgreeRatioKey, + c.aboveHelpfulnessThresholdKey, + ] def get_internal_helpfulness_scores_cols(self) -> List[str]: """Returns a list of internal columns which should be present in the helpfulnessScores output.""" @@ -297,6 +358,9 @@ def get_internal_helpfulness_scores_cols(self) -> List[str]: c.meanNoteScoreKey, c.raterAgreeRatioKey, c.aboveHelpfulnessThresholdKey, + c.lowDiligenceRaterInterceptKey, + c.lowDiligenceRaterFactor1Key, + c.lowDiligenceRaterReputationKey, ] def get_auxiliary_note_info_cols(self) -> List[str]: @@ -331,13 +395,20 @@ def _get_dropped_user_cols(self) -> List[str]: """Returns a list of columns which should be excluded from helpfulnessScores output.""" return [] - def _prepare_data_for_scoring(self, ratings: pd.DataFrame) -> pd.DataFrame: + def _prepare_data_for_scoring(self, ratings: pd.DataFrame, final: bool = False) -> pd.DataFrame: """Prepare data for scoring. This includes filtering out notes and raters which do not meet minimum rating counts, and may be overridden by subclasses to add additional filtering. """ - return process_data.filter_ratings( - ratings, self._minNumRatingsPerRater, self._minNumRatersPerNote - ) + if final: + return process_data.filter_ratings( + ratings, minNumRatingsPerRater=0, minNumRatersPerNote=self._minNumRatersPerNote + ) + else: + return process_data.filter_ratings( + ratings, + minNumRatingsPerRater=self._minNumRatingsPerRater, + minNumRatersPerNote=self._minNumRatersPerNote, + ) def _run_regular_matrix_factorization(self, ratingsForTraining: pd.DataFrame): """Train a matrix factorization model on the ratingsForTraining data. @@ -397,9 +468,26 @@ def _run_stable_matrix_factorization( ) return modelResult + def compute_tag_thresholds_for_percentile( + self, scoredNotes, raterParams, ratings + ) -> Dict[str, float]: + with c.time_block(f"{self.get_name()}: Compute tag thresholds for percentiles"): + # Compute tag aggregates (in the same way as is done in final scoring in note_ratings.compute_scored_notes) + tagAggregates = tag_filter.get_note_tag_aggregates(ratings, scoredNotes, raterParams) + assert len(tagAggregates) == len( + scoredNotes + ), "There should be one aggregate per scored note." + scoredNotes = tagAggregates.merge(scoredNotes, on=c.noteIdKey, how="outer") + + # Compute percentile thresholds for each tag + crhNotes = scoredNotes[scoredNotes[c.currentlyRatedHelpfulBoolKey]][[c.noteIdKey]] + crhStats = scoredNotes.merge(crhNotes, on=c.noteIdKey, how="inner") + thresholds = tag_filter.get_tag_thresholds(crhStats, self._tagFilterPercentile) + return thresholds + def _prescore_notes_and_users( self, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, userEnrollmentRaw: pd.DataFrame - ) -> Tuple[pd.DataFrame, pd.DataFrame]: + ) -> Tuple[pd.DataFrame, pd.DataFrame, c.PrescoringMetaScorerOutput]: """ Fit initial matrix factorization model(s) on the ratings data in order to generate initial note and rater parameters (and rater helpfulness scores) that are passed to @@ -420,13 +508,28 @@ def _prescore_notes_and_users( helpfulnessScores (pd.DataFrame) """ if self._seed is not None: - print(f"seeding with {self._seed}") + logger.info(f"seeding with {self._seed}") torch.manual_seed(self._seed) # Removes ratings where either (1) the note did not receive enough ratings, or # (2) the rater did not rate enough notes. with self.time_block("Prepare ratings"): - ratingsForTraining = self._prepare_data_for_scoring(ratings) + ratingsForTraining = self._prepare_data_for_scoring( + ratings[ + [ + c.noteIdKey, + c.raterParticipantIdKey, + c.helpfulNumKey, + c.createdAtMillisKey, + c.helpfulnessLevelKey, + c.notHelpfulIncorrectTagKey, + c.notHelpfulIrrelevantSourcesTagKey, + c.notHelpfulSourcesMissingOrUnreliableTagKey, + c.notHelpfulSpamHarassmentOrAbuseTagKey, + c.notHelpfulOtherTagKey, + ] + ] + ) if self._saveIntermediateState: self.ratingsForTraining = ratingsForTraining @@ -436,19 +539,24 @@ def _prescore_notes_and_users( noteParamsUnfiltered, raterParamsUnfiltered, globalBias, - ) = self._run_stable_matrix_factorization(ratingsForTraining, userEnrollmentRaw) + ) = self._run_stable_matrix_factorization( + ratingsForTraining[[c.noteIdKey, c.raterParticipantIdKey, c.helpfulNumKey]], + userEnrollmentRaw[[c.participantIdKey, c.modelingGroupKey]], + ) if self._saveIntermediateState: self.noteParamsUnfiltered = noteParamsUnfiltered self.raterParamsUnfiltered = raterParamsUnfiltered self.globalBias = globalBias - self.assert_train_error_is_below_threshold(ratingsForTraining, self._maxFirstMFTrainError) + self.assert_train_error_is_below_threshold( + ratingsForTraining[[c.noteIdKey]], self._maxFirstMFTrainError + ) # If reputation is disabled, generate final intercepts, factors and note status # based on the first round scoring results. Disabling reputation can be desirable # in situations where the overall volume of ratings is lower (e.g. topic models). if not self._useReputation: assert "Topic" in self.get_name(), f"Unexpected scorer: {self.get_name()}" - print(f"Skipping rep-filtering in prescoring for {self.get_name()}") + logger.info(f"Skipping rep-filtering in prescoring for {self.get_name()}") helpfulnessScores = raterParamsUnfiltered[[c.raterParticipantIdKey]] helpfulnessScores[ [ @@ -458,16 +566,47 @@ def _prescore_notes_and_users( c.aboveHelpfulnessThresholdKey, ] ] = np.nan + noteParams = noteParamsUnfiltered + raterParams = raterParamsUnfiltered + # TODO: delete after we run prescoring diligence properly + # diligenceGlobalIntercept = None + finalRoundRatings = ratingsForTraining else: assert "Topic" not in self.get_name(), f"Unexpected scorer: {self.get_name()}" - print(f"Performing rep-filtering for {self.get_name()}") + logger.info(f"Performing rep-filtering for {self.get_name()}") # Get a dataframe of scored notes based on the algorithm results above with self.time_block("Compute scored notes"): scoredNotes = note_ratings.compute_scored_notes( - ratings, - noteParamsUnfiltered, - raterParamsUnfiltered, - noteStatusHistory, + ratings[ + [c.noteIdKey, c.raterParticipantIdKey, c.helpfulnessLevelKey, c.createdAtMillisKey] + + c.notHelpfulTagsTSVOrder + + c.helpfulTagsTSVOrder + ], + keep_columns( + noteParamsUnfiltered, + [ + c.noteIdKey, + c.internalNoteInterceptKey, + c.internalNoteFactor1Key, + ] + + c.noteParameterUncertaintyTSVColumns, + ), + raterParamsUnfiltered[ + [ + c.raterParticipantIdKey, + c.internalRaterFactor1Key, + ] + ], + noteStatusHistory[ + [ + c.noteIdKey, + c.createdAtMillisKey, + c.noteAuthorParticipantIdKey, + c.classificationKey, + c.currentLabelKey, + c.lockedStatusKey, + ] + ], minRatingsNeeded=self._minRatingsNeeded, crhThreshold=self._crhThreshold, crnhThresholdIntercept=self._crnhThresholdIntercept, @@ -476,7 +615,10 @@ def _prescore_notes_and_users( crnhThresholdUCBIntercept=self._crnhThresholdUCBIntercept, crhSuperThreshold=self._crhSuperThreshold, inertiaDelta=self._inertiaDelta, - lowDiligenceThreshold=self._lowDiligenceThreshold, + incorrectFilterThreshold=self._incorrectFilterThreshold, + tagFilterThresholds=None, + finalRound=False, + firmRejectThreshold=self._firmRejectThreshold, ) if self._saveIntermediateState: self.prescoringScoredNotes = scoredNotes @@ -484,8 +626,10 @@ def _prescore_notes_and_users( # Determine "valid" ratings with self.time_block("Compute valid ratings"): validRatings = note_ratings.get_valid_ratings( - ratings, - noteStatusHistory, + ratings[[c.noteIdKey, c.raterParticipantIdKey, c.helpfulNumKey, c.createdAtMillisKey]], + noteStatusHistory[ + [c.noteIdKey, c.createdAtMillisKey, c.timestampMillisOfNoteMostRecentNonNMRLabelKey] + ], scoredNotes[ [ c.noteIdKey, @@ -511,11 +655,13 @@ def _prescore_notes_and_users( c.internalNoteInterceptKey, ] ], - validRatings, + validRatings[ + [c.raterParticipantIdKey, c.ratingAgreesWithNoteStatusKey, c.ratingCountKey] + ], self._minMeanNoteScore, self._minCRHVsCRNHRatio, self._minRaterAgreeRatio, - ratingsForTraining, + ratingsForTraining[[c.noteIdKey, c.raterParticipantIdKey, c.helpfulNumKey]], ) ) if self._saveIntermediateState: @@ -526,7 +672,17 @@ def _prescore_notes_and_users( with self.time_block("Filtering by helpfulness score"): ratingsHelpfulnessScoreFilteredPreHarassmentFilter = ( helpfulness_scores.filter_ratings_by_helpfulness_scores( - ratingsForTraining, helpfulnessScoresPreHarassmentFilter + ratingsForTraining[ + [ + c.noteIdKey, + c.raterParticipantIdKey, + c.notHelpfulSpamHarassmentOrAbuseTagKey, + c.createdAtMillisKey, + c.helpfulnessLevelKey, + c.notHelpfulOtherTagKey, + ] + ], + helpfulnessScoresPreHarassmentFilter, ) ) @@ -539,10 +695,15 @@ def _prescore_notes_and_users( harassmentAbuseNoteParams, _, _ = tag_consensus.train_tag_model( ratingsHelpfulnessScoreFilteredPreHarassmentFilter, c.notHelpfulSpamHarassmentOrAbuseTagKey, - noteParamsUnfiltered, - raterParamsUnfiltered, + noteParamsUnfiltered[[c.noteIdKey, c.internalNoteInterceptKey, c.internalNoteFactor1Key]], + raterParamsUnfiltered[ + [c.raterParticipantIdKey, c.internalRaterInterceptKey, c.internalRaterFactor1Key] + ], name="harassment", ) + if not self._saveIntermediateState: + del ratingsHelpfulnessScoreFilteredPreHarassmentFilter + gc.collect() # Assigns contributor (author & rater) helpfulness bit based on (1) performance # authoring and reviewing previous and current notes, and (2) including an extra @@ -557,32 +718,161 @@ def _prescore_notes_and_users( c.internalNoteInterceptKey, ] ], - validRatings, + validRatings[ + [c.raterParticipantIdKey, c.ratingAgreesWithNoteStatusKey, c.ratingCountKey] + ], self._minMeanNoteScore, self._minCRHVsCRNHRatio, self._minRaterAgreeRatio, - ratings=ratingsForTraining, + ratings=ratingsForTraining[[c.noteIdKey, c.raterParticipantIdKey, c.helpfulNumKey]], tagConsensusHarassmentAbuseNotes=harassmentAbuseNoteParams, tagConsensusHarassmentHelpfulRatingPenalty=self.tagConsensusHarassmentHelpfulRatingPenalty, multiplyPenaltyByHarassmentScore=self.multiplyPenaltyByHarassmentScore, minimumHarassmentScoreToPenalize=self.minimumHarassmentScoreToPenalize, ) + if not self._saveIntermediateState: + del validRatings + gc.collect() if self._saveIntermediateState: self.helpfulnessScores = helpfulnessScores ## One extra final round! # Filter ratings based on prev helpfulness scores - finalRoundRatings = helpfulness_scores.filter_ratings_by_helpfulness_scores( - ratingsForTraining, helpfulnessScores + with c.time_block("Final round MF"): + finalRoundRatings = helpfulness_scores.filter_ratings_by_helpfulness_scores( + ratingsForTraining[ + [ + c.noteIdKey, + c.raterParticipantIdKey, + c.helpfulNumKey, + c.notHelpfulIncorrectTagKey, + c.notHelpfulSourcesMissingOrUnreliableTagKey, + c.notHelpfulIrrelevantSourcesTagKey, + ] + ], + helpfulnessScores[[c.raterParticipantIdKey, c.aboveHelpfulnessThresholdKey]], + ) + noteParams, raterParams, globalBias = self._mfRanker.run_mf( + ratings=finalRoundRatings[[c.noteIdKey, c.raterParticipantIdKey, c.helpfulNumKey]], + noteInit=noteParamsUnfiltered[ + [c.noteIdKey, c.internalNoteInterceptKey, c.internalNoteFactor1Key] + ], + userInit=raterParamsUnfiltered[ + [c.raterParticipantIdKey, c.internalRaterInterceptKey, c.internalRaterFactor1Key] + ], + ) + + # Run Diligence MF Prescoring, based on the final MF + with self.time_block("Low Diligence MF"): + # Initialize diligence rater factors with final round helpful MF rater factor + raterParamsDiligenceInit = raterParams[ + [c.raterParticipantIdKey, c.internalRaterFactor1Key] + ].rename({c.internalRaterFactor1Key: c.lowDiligenceRaterFactor1Key}, axis=1) + logger.info( + f"In {self.get_name()} prescoring, about to call diligence with {len(finalRoundRatings)} final round ratings." ) - # Run MF - noteParamsUnfiltered, raterParamsUnfiltered, globalBias = self._mfRanker.run_mf( - ratings=finalRoundRatings, - noteInit=noteParamsUnfiltered, - userInit=raterParamsUnfiltered, + ( + diligenceNoteParams, + diligenceRaterParams, + diligenceGlobalIntercept, + ) = fit_low_diligence_model_prescoring( + finalRoundRatings[ + [ + c.noteIdKey, + c.raterParticipantIdKey, + c.notHelpfulIncorrectTagKey, + c.notHelpfulSourcesMissingOrUnreliableTagKey, + c.notHelpfulIrrelevantSourcesTagKey, + ] + ], + raterInitStateDiligence=raterParamsDiligenceInit, ) + noteParams = noteParams.merge(diligenceNoteParams, on=c.noteIdKey) + raterParams = raterParams.merge(diligenceRaterParams, on=c.raterParticipantIdKey) + + # Compute scored notes -- currently not returned; only used for downstream computation. + scoredNotes = note_ratings.compute_scored_notes( + ratings[ + [c.noteIdKey, c.raterParticipantIdKey, c.helpfulnessLevelKey, c.createdAtMillisKey] + + c.notHelpfulTagsTSVOrder + + c.helpfulTagsTSVOrder + ], + keep_columns( + noteParamsUnfiltered, + [ + c.noteIdKey, + c.internalNoteInterceptKey, + c.internalNoteFactor1Key, + ] + + c.noteParameterUncertaintyTSVColumns, + ), + raterParamsUnfiltered[ + [ + c.raterParticipantIdKey, + c.internalRaterFactor1Key, + ] + ], + noteStatusHistory[ + [ + c.noteIdKey, + c.createdAtMillisKey, + c.noteAuthorParticipantIdKey, + c.classificationKey, + c.currentLabelKey, + c.lockedStatusKey, + ] + ], + minRatingsNeeded=self._minRatingsNeeded, + crhThreshold=self._crhThreshold, + crnhThresholdIntercept=self._crnhThresholdIntercept, + crnhThresholdNoteFactorMultiplier=self._crnhThresholdNoteFactorMultiplier, + crnhThresholdNMIntercept=self._crnhThresholdNMIntercept, + crnhThresholdUCBIntercept=self._crnhThresholdUCBIntercept, + crhSuperThreshold=self._crhSuperThreshold, + inertiaDelta=self._inertiaDelta, + tagFilterThresholds=None, + incorrectFilterThreshold=self._incorrectFilterThreshold, + finalRound=False, + factorThreshold=self._factorThreshold, + firmRejectThreshold=self._firmRejectThreshold, + ) + + # Compute meta output + metaOutput = c.PrescoringMetaScorerOutput( + globalIntercept=globalBias, + lowDiligenceGlobalIntercept=diligenceGlobalIntercept, + tagFilteringThresholds=self.compute_tag_thresholds_for_percentile( + scoredNotes=noteParams[[c.noteIdKey, c.internalNoteFactor1Key]].merge( + scoredNotes[[c.noteIdKey, c.currentlyRatedHelpfulBoolKey]], + on=c.noteIdKey, + suffixes=("", "_dup"), + ), + raterParams=raterParams[[c.raterParticipantIdKey, c.internalRaterFactor1Key]], + ratings=ratings[ + [ + c.noteIdKey, + c.raterParticipantIdKey, + ] + + c.notHelpfulTagsTSVOrder + ], + ), + finalRoundNumRatings=len(finalRoundRatings), + finalRoundNumNotes=finalRoundRatings[c.noteIdKey].nunique(), + finalRoundNumUsers=finalRoundRatings[c.raterParticipantIdKey].nunique(), + ) + + # Compute user incorrect tag aggregates + userIncorrectTagUsageDf = get_user_incorrect_ratio( + ratings[ + [ + c.noteIdKey, + c.raterParticipantIdKey, + ] + + c.notHelpfulTagsTSVOrder + ] + ) - raterModelOutput = raterParamsUnfiltered.merge( + raterModelOutput = raterParams.merge( helpfulnessScores[ [ c.raterParticipantIdKey, @@ -594,10 +884,21 @@ def _prescore_notes_and_users( ], on=c.raterParticipantIdKey, how="outer", + ).merge( + userIncorrectTagUsageDf, + on=c.raterParticipantIdKey, + how="left", + unsafeAllowed={c.totalRatingsMadeByRaterKey, c.incorrectTagRatingsMadeByRaterKey}, ) - noteModelOutput = noteParamsUnfiltered - return noteModelOutput, raterModelOutput + noteModelOutput = noteParams + # Returning should remove references to these, but manually trigger GC just to reclaim + # resources as soon as possible. + del ratings + del ratingsForTraining + del finalRoundRatings + gc.collect() + return noteModelOutput, raterModelOutput, metaOutput def _score_notes_and_users( self, @@ -605,6 +906,8 @@ def _score_notes_and_users( noteStatusHistory: pd.DataFrame, prescoringNoteModelOutput: pd.DataFrame, prescoringRaterModelOutput: pd.DataFrame, + prescoringMetaScorerOutput: c.PrescoringMetaScorerOutput, + flipFactorsForIdentification: bool = False, ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Run the "final" matrix factorization scoring algorithm. Accepts prescoring's output as its input, as well as the new ratings and note status history. @@ -624,24 +927,34 @@ def _score_notes_and_users( userScores pd.DataFrame: one row per user containing a column for each helpfulness score. """ if self._seed is not None: - print(f"seeding with {self._seed}") + logger.info(f"seeding with {self._seed}") torch.manual_seed(self._seed) - # Removes ratings where either (1) the note did not receive enough ratings, or - # (2) the rater did not rate enough notes. + # Removes ratings where either the note did not receive enough ratings with self.time_block("Prepare ratings"): - ratingsForTraining = self._prepare_data_for_scoring(ratings) + ratingsForTraining = self._prepare_data_for_scoring(ratings, final=True) if self._saveIntermediateState: self.ratingsForTraining = ratingsForTraining + # Filter raters with no rater parameters in this scorer + ratersWithParams = prescoringRaterModelOutput.loc[ + ( + (~pd.isna(prescoringRaterModelOutput[c.internalRaterInterceptKey])) + & (~pd.isna(prescoringRaterModelOutput[c.internalRaterInterceptKey])) + ), + [c.raterParticipantIdKey], + ] + ratingsForTraining = ratingsForTraining.merge( + ratersWithParams, how="inner", on=c.raterParticipantIdKey + ) + # Filters ratings matrix to include only rows (ratings) where the rater was # considered helpful. if not self._useReputation: assert ( "Topic" in self.get_name() ), f"Unexpected scorer has reputation filtering disabled: {self.get_name()}" - print(f"Skipping rep-filtering in 2nd phase for {self.get_name()}") - ## Still run entire scorer again here for topic models! Just run this final round from scratch. + logger.info(f"Skipping rep-filtering in 2nd phase for {self.get_name()}") finalRoundRatings = ratingsForTraining else: finalRoundRatings = helpfulness_scores.filter_ratings_by_helpfulness_scores( @@ -650,14 +963,28 @@ def _score_notes_and_users( if self._saveIntermediateState: self.finalRoundRatings = finalRoundRatings + assert ( + prescoringMetaScorerOutput.finalRoundNumNotes is not None + ), "Missing final round num notes" + assert ( + prescoringMetaScorerOutput.finalRoundNumRatings is not None + ), "Missing final round num ratings" + assert ( + prescoringMetaScorerOutput.finalRoundNumUsers is not None + ), "Missing final round num users" + # Re-runs matrix factorization using only ratings given by helpful raters. with self.time_block("Final helpfulness-filtered MF"): noteParams, raterParams, globalBias = self._mfRanker.run_mf( ratings=finalRoundRatings, noteInit=prescoringNoteModelOutput, userInit=prescoringRaterModelOutput, - globalInterceptInit=0.17, + globalInterceptInit=prescoringMetaScorerOutput.globalIntercept, freezeRaterParameters=True, + freezeGlobalParameters=True, + ratingPerNoteLossRatio=prescoringMetaScorerOutput.finalRoundNumRatings + / prescoringMetaScorerOutput.finalRoundNumNotes, + flipFactorsForIdentification=flipFactorsForIdentification, ) if self._saveIntermediateState: @@ -665,7 +992,7 @@ def _score_notes_and_users( self.raterParams = raterParams self.globalBias = globalBias self.finalRoundRatings = finalRoundRatings - self.assert_train_error_is_below_threshold(finalRoundRatings, self._maxFinalMFTrainError) + # self.assert_train_error_is_below_threshold(finalRoundRatings, self._maxFinalMFTrainError) # Add pseudo-raters with the most extreme parameters and re-score notes, to estimate # upper and lower confidence bounds on note parameters. @@ -683,21 +1010,46 @@ def _score_notes_and_users( # Add low diligence intercepts. with self.time_block("Low Diligence Reputation Model"): - diligenceParams = get_low_diligence_intercepts(finalRoundRatings, raterInitState=raterParams) - noteParams = noteParams.merge(diligenceParams, on=c.noteIdKey) + logger.info( + f"In {self.get_name()} final scoring, about to call diligence with {len(finalRoundRatings)} final round ratings." + ) + assert ( + prescoringMetaScorerOutput.lowDiligenceGlobalIntercept is not None + ), "Missing low diligence global intercept" + diligenceNoteParams, diligenceRaterParams = fit_low_diligence_model_final( + finalRoundRatings, + noteInitStateDiligence=prescoringNoteModelOutput, + raterInitStateDiligence=prescoringRaterModelOutput, + globalInterceptDiligence=prescoringMetaScorerOutput.lowDiligenceGlobalIntercept, + ratingsPerNoteLossRatio=prescoringMetaScorerOutput.finalRoundNumRatings + / prescoringMetaScorerOutput.finalRoundNumNotes, + ratingsPerUserLossRatio=prescoringMetaScorerOutput.finalRoundNumRatings + / prescoringMetaScorerOutput.finalRoundNumUsers, + ) + logger.info(f"diligenceNP cols: {diligenceNoteParams.columns}") + noteParams = noteParams.merge(diligenceNoteParams, on=c.noteIdKey) + logger.info(f"np cols: {noteParams.columns}") if self._saveIntermediateState: self.noteParams = noteParams self.raterParams = raterParams self.globalBias = globalBias + raterParamsWithRatingCounts = raterParams.merge( + prescoringRaterModelOutput[ + [c.raterParticipantIdKey, c.incorrectTagRatingsMadeByRaterKey, c.totalRatingsMadeByRaterKey] + ], + on=c.raterParticipantIdKey, + ) + # Assigns updated CRH / CRNH bits to notes based on volume of prior ratings # and ML output. with self.time_block("Final compute scored notes"): + logger.info(f"About to call compute_scored_notes with {self.get_name()}") scoredNotes = note_ratings.compute_scored_notes( ratings, noteParams, - raterParams, + raterParamsWithRatingCounts, noteStatusHistory, minRatingsNeeded=self._minRatingsNeeded, crhThreshold=self._crhThreshold, @@ -707,10 +1059,14 @@ def _score_notes_and_users( crnhThresholdUCBIntercept=self._crnhThresholdUCBIntercept, crhSuperThreshold=self._crhSuperThreshold, inertiaDelta=self._inertiaDelta, + tagFilterThresholds=prescoringMetaScorerOutput.tagFilteringThresholds, + incorrectFilterThreshold=self._incorrectFilterThreshold, lowDiligenceThreshold=self._lowDiligenceThreshold, finalRound=True, factorThreshold=self._factorThreshold, + firmRejectThreshold=self._firmRejectThreshold, ) + logger.info(f"sn cols: {scoredNotes.columns}") # Takes raterParams from the MF run, but use the pre-computed # helpfulness scores from prescoringRaterModelOutput. @@ -733,3 +1089,115 @@ def _score_notes_and_users( self.helpfulnessScores = helpfulnessScores return scoredNotes, helpfulnessScores + + def score_final(self, scoringArgs: c.FinalScoringArgs) -> c.ModelResult: + """ + Process ratings to assign status to notes and optionally compute rater properties. + + Accepts prescoringNoteModelOutput and prescoringRaterModelOutput as args (fields on scoringArgs) + which are the outputs of the prescore() function. These are used to initialize the final scoring. + It filters the prescoring output to only include the rows relevant to this scorer, based on the + c.scorerNameKey field of those dataframes. + """ + torch.set_num_threads(self._threads) + logger.info( + f"score_final: Torch intra-op parallelism for {self.get_name()} set to: {torch.get_num_threads()}" + ) + + # Filter unfiltered params to just params for this scorer (with copy). + # Avoid editing the dataframe in FinalScoringArgs, which is shared across scorers. + prescoringNoteModelOutput = scoringArgs.prescoringNoteModelOutput[ + scoringArgs.prescoringNoteModelOutput[c.scorerNameKey] == self.get_name() + ].drop(columns=c.scorerNameKey, inplace=False) + + if scoringArgs.prescoringRaterModelOutput is None: + return self._return_empty_final_scores() + prescoringRaterModelOutput = scoringArgs.prescoringRaterModelOutput[ + scoringArgs.prescoringRaterModelOutput[c.scorerNameKey] == self.get_name() + ].drop(columns=c.scorerNameKey, inplace=False) + + if self.get_name() not in scoringArgs.prescoringMetaOutput.metaScorerOutput: + logger.info( + f"Scorer {self.get_name()} not found in prescoringMetaOutput; returning empty scores from final scoring." + ) + return self._return_empty_final_scores() + prescoringMetaScorerOutput = scoringArgs.prescoringMetaOutput.metaScorerOutput[self.get_name()] + + # Filter raw input + with self.time_block("Filter input"): + ratings, noteStatusHistory = self._filter_input( + scoringArgs.noteTopics, + scoringArgs.ratings, + scoringArgs.noteStatusHistory, + scoringArgs.userEnrollment, + ) + # If there are no ratings left after filtering, then return empty dataframes. + if len(ratings) == 0: + return self._return_empty_final_scores() + + noteScores, userScores = self._score_notes_and_users( + ratings=ratings, + noteStatusHistory=noteStatusHistory, + prescoringNoteModelOutput=prescoringNoteModelOutput, + prescoringRaterModelOutput=prescoringRaterModelOutput, + prescoringMetaScorerOutput=prescoringMetaScorerOutput, + flipFactorsForIdentification=False, + ) + + with self.time_block("Postprocess output"): + # Only some subclasses do any postprocessing. + # E.g. topic models add confidence bit, group models prune according to authorship filter. + noteScores, userScores = self._postprocess_output( + noteScores, + userScores, + scoringArgs.ratings, + scoringArgs.noteStatusHistory, + scoringArgs.userEnrollment, + ) + + ## TODO: refactor this logic to compute 2nd round ratings out so score_final doesn't need to be overridden and duplicated. + scoredNoteFinalRoundRatings = ( + ratings[[c.raterParticipantIdKey, c.noteIdKey]] + .merge(userScores[[c.raterParticipantIdKey]], on=c.raterParticipantIdKey) + .groupby(c.noteIdKey) + .agg("count") + .reset_index() + .rename(columns={c.raterParticipantIdKey: c.numFinalRoundRatingsKey}) + ) + + noteScores = noteScores.merge( + scoredNoteFinalRoundRatings, + on=c.noteIdKey, + how="left", + unsafeAllowed=[c.defaultIndexKey, c.numFinalRoundRatingsKey], + ) + + noteScores = noteScores.rename(columns=self._get_note_col_mapping()) + userScores = userScores.rename(columns=self._get_user_col_mapping()) + + # Process noteScores + noteScores = noteScores.drop(columns=self._get_dropped_note_cols()) + assert set(noteScores.columns) == set( + self.get_scored_notes_cols() + self.get_auxiliary_note_info_cols() + ), f"""all columns must be either dropped or explicitly defined in an output. + Extra columns that were in noteScores: {set(noteScores.columns) - set(self.get_scored_notes_cols() + self.get_auxiliary_note_info_cols())} + Missing expected columns that should've been in noteScores: {set(self.get_scored_notes_cols() + self.get_auxiliary_note_info_cols()) - set(noteScores.columns)}""" + + # Process userScores + userScores = userScores.drop(columns=self._get_dropped_user_cols()) + assert set(userScores.columns) == set(self.get_helpfulness_scores_cols()), f"""all columns must be either dropped or explicitly defined in an output. + Extra columns that were in userScores: {set(userScores.columns) - set(self.get_helpfulness_scores_cols())} + Missing expected columns that should've been in userScores: {set(self.get_helpfulness_scores_cols()) - set(userScores.columns)}""" + + # Return dataframes with specified columns in specified order + return c.ModelResult( + scoredNotes=noteScores[self.get_scored_notes_cols()], + helpfulnessScores=userScores[self.get_helpfulness_scores_cols()] + if self.get_helpfulness_scores_cols() + else None, + auxiliaryNoteInfo=noteScores[self.get_auxiliary_note_info_cols()] + if self.get_auxiliary_note_info_cols() + else None, + scorerName=self.get_name(), + metaScores=None, + ) diff --git a/sourcecode/scoring/mf_core_scorer.py b/sourcecode/scoring/mf_core_scorer.py index c4d5786f..757d5bf0 100644 --- a/sourcecode/scoring/mf_core_scorer.py +++ b/sourcecode/scoring/mf_core_scorer.py @@ -1,104 +1,8 @@ -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional from . import constants as c from .mf_base_scorer import MFBaseScorer -import numpy as np -import pandas as pd - - -_CORE_BOOL = "coreBool" -_TOTAL = "total" -_RATIO = "ratio" - - -def filter_core_input( - ratingsOrig: pd.DataFrame, - noteStatusHistoryOrig: pd.DataFrame, - userEnrollment: pd.DataFrame, - coreThreshold: float = 0.5, -) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Prune the contents of ratings and noteStatusHistory to scope model behavior. - - This function identifies the subset of note and ratings to include in core model scoring. - A note is included in the core model if >50% of the ratings on the note come from users - in the CORE modelingPopulation. A rating is included in the core model if the rating is - on a CORE note *and* the rating is from a user in the CORE modeling population. - - Note that the criteria above implies that a note without any ratings can't be included in - the CORE model, which is acceptable because notes without ratings will be assigned a default - status of NEEDS_MORE_RATINGS by both the EXPANSION model and meta_score. - - Args: - ratings (pd.DataFrame): preprocessed ratings - noteStatusHistory (pd.DataFrame): one row per note; history of when note had each status - userEnrollment (pd.DataFrame): one row per user specifying enrollment properties - - Returns: - Tuple[pd.DataFrame, pd.DataFrame]: - ratings: ratings filtered to only contain rows of interest - noteStatusHistory: noteStatusHistory filtered to only contain rows of interest - """ - print("Identifying core notes and ratings") - # Identify EXPANSION_PLUS users and notes - expansionPlusUsers = set( - userEnrollment[userEnrollment[c.modelingPopulationKey] == c.expansionPlus][ - c.participantIdKey - ].values - ) - print(f" EXPANSION_PLUS users: {len(expansionPlusUsers)}") - expansionPlusNotes = set( - noteStatusHistoryOrig[ - noteStatusHistoryOrig[c.noteAuthorParticipantIdKey].isin(expansionPlusUsers) - ][c.noteIdKey].values - ) - print(f" EXPANSION_PLUS notes: {len(expansionPlusNotes)}") - # Remove EXPANSION_PLUS users and notes - print(f" original note status history length: {len(noteStatusHistoryOrig)}") - noteStatusHistory = noteStatusHistoryOrig[ - ~noteStatusHistoryOrig[c.noteAuthorParticipantIdKey].isin(expansionPlusUsers) - ] - print(f" note status history length after EXPANSION_PLUS filter: {len(noteStatusHistory)}") - print(f" original ratings length: {len(ratingsOrig)}") - ratings = ratingsOrig[~ratingsOrig[c.raterParticipantIdKey].isin(expansionPlusUsers)] - ratings = ratings[~ratings[c.noteIdKey].isin(expansionPlusNotes)] - print(f" ratings length after EXPANSION_PLUS filter: {len(ratings)}") - # Separate CORE and EXPANSION notes. - userEnrollment[_CORE_BOOL] = userEnrollment[c.modelingPopulationKey] == c.core - userGroups = userEnrollment[[c.participantIdKey, _CORE_BOOL]].copy() - ratings = ratings.merge( - userGroups.rename(columns={c.participantIdKey: c.raterParticipantIdKey}), - on=c.raterParticipantIdKey, - how="left", - ) - print(f" Ratings from user without modelingPopulation: {pd.isna(ratings[_CORE_BOOL]).sum()}") - ratings = ratings.fillna({_CORE_BOOL: True}) - ratings[_CORE_BOOL] = ratings[_CORE_BOOL].astype(np.bool8) - counts = ratings[[c.noteIdKey, _CORE_BOOL]].copy() - counts[_TOTAL] = 1 - counts = counts.groupby(c.noteIdKey).sum(numeric_only=True).reset_index() - counts[_RATIO] = counts[_CORE_BOOL] / counts[_TOTAL] - # Identify CORE notes. We define an EXPANSION note to be any note which (1) has ratings - # and (2) less than half of the ratings are from CORE users. Any other note is considered - # a CORE note. This construction means that we only count a note as EXPANSION when there - # is reason to believe that the EXPANSION model could assign the note status. In all other - # case we leave the note as CORE so that the note will be eligble for locking. In effect, - # this approach biases us towards locking note status at 2 weeks and only avoiding locking - # when a note is scored by the EXPANSION model. - print(f" Total notes: {len(noteStatusHistory)}") - print(f" Total notes with ratings: {len(counts)}") - expansionNotes = set(counts[counts[_RATIO] <= coreThreshold][c.noteIdKey]) - coreNotes = set(noteStatusHistory[c.noteIdKey]) - expansionNotes - print(f" Total core notes: {len(coreNotes)}") - print(f" Total expansion notes: {len(expansionNotes)}") - # Prune notes and ratings to ratings from CORE users on CORE notes. - ratings = ratings[ratings[_CORE_BOOL]] - ratings = ratings.drop(columns=_CORE_BOOL) - ratings = ratings[ratings[c.noteIdKey].isin(coreNotes)] - noteStatusHistory = noteStatusHistory[noteStatusHistory[c.noteIdKey].isin(coreNotes)] - print(f" Core ratings: {len(ratings)}") - return ratings, noteStatusHistory - class MFCoreScorer(MFBaseScorer): def __init__( @@ -108,6 +12,7 @@ def __init__( useStableInitialization: bool = True, saveIntermediateState: bool = False, threads: int = c.defaultNumThreads, + firmRejectThreshold: Optional[float] = 0.3, ) -> None: """Configure MFCoreScorer object. @@ -117,11 +22,15 @@ def __init__( threads: number of threads to use for intra-op parallelism in pytorch """ super().__init__( - seed, - pseudoraters, + includedGroups=c.coreGroups, + includeUnassigned=True, + captureThreshold=0.5, + seed=seed, + pseudoraters=pseudoraters, useStableInitialization=useStableInitialization, saveIntermediateState=saveIntermediateState, threads=threads, + firmRejectThreshold=firmRejectThreshold, ) def get_name(self): @@ -136,6 +45,8 @@ def _get_note_col_mapping(self) -> Dict[str, str]: c.internalActiveRulesKey: c.coreActiveRulesKey, c.noteInterceptMinKey: c.coreNoteInterceptMinKey, c.noteInterceptMaxKey: c.coreNoteInterceptMaxKey, + c.numFinalRoundRatingsKey: c.coreNumFinalRoundRatingsKey, + c.lowDiligenceNoteInterceptKey: c.lowDiligenceLegacyNoteInterceptKey, } def _get_user_col_mapping(self) -> Dict[str, str]: @@ -156,6 +67,7 @@ def get_scored_notes_cols(self) -> List[str]: c.activeFilterTagsKey, c.coreNoteInterceptMinKey, c.coreNoteInterceptMaxKey, + c.coreNumFinalRoundRatingsKey, ] def get_helpfulness_scores_cols(self) -> List[str]: @@ -169,13 +81,3 @@ def get_helpfulness_scores_cols(self) -> List[str]: c.raterAgreeRatioKey, c.aboveHelpfulnessThresholdKey, ] - - def _filter_input( - self, - noteTopics: pd.DataFrame, - ratingsOrig: pd.DataFrame, - noteStatusHistoryOrig: pd.DataFrame, - userEnrollment: pd.DataFrame, - ) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Prune the contents of ratings and noteStatusHistory to scope model behavior.""" - return filter_core_input(ratingsOrig, noteStatusHistoryOrig, userEnrollment) diff --git a/sourcecode/scoring/mf_expansion_plus_scorer.py b/sourcecode/scoring/mf_expansion_plus_scorer.py index 48e315c1..065f8177 100644 --- a/sourcecode/scoring/mf_expansion_plus_scorer.py +++ b/sourcecode/scoring/mf_expansion_plus_scorer.py @@ -1,10 +1,8 @@ -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional from . import constants as c from .mf_base_scorer import MFBaseScorer -import pandas as pd - class MFExpansionPlusScorer(MFBaseScorer): def __init__( @@ -21,7 +19,9 @@ def __init__( threads: number of threads to use for intra-op parallelism in pytorch """ super().__init__( - seed, + includedGroups=(c.coreGroups | c.expansionGroups | c.expansionPlusGroups), + includeUnassigned=True, + seed=seed, pseudoraters=False, useStableInitialization=useStableInitialization, saveIntermediateState=saveIntermediateState, @@ -38,6 +38,15 @@ def _get_note_col_mapping(self) -> Dict[str, str]: c.internalNoteFactor1Key: c.expansionPlusNoteFactor1Key, c.internalRatingStatusKey: c.expansionPlusRatingStatusKey, c.internalActiveRulesKey: c.expansionPlusInternalActiveRulesKey, + c.numFinalRoundRatingsKey: c.expansionPlusNumFinalRoundRatingsKey, + c.lowDiligenceNoteInterceptKey: c.lowDiligenceLegacyNoteInterceptKey, + } + + def _get_user_col_mapping(self) -> Dict[str, str]: + """Returns a dict mapping default user column names to custom names for a specific model.""" + return { + c.internalRaterInterceptKey: c.expansionPlusRaterInterceptKey, + c.internalRaterFactor1Key: c.expansionPlusRaterFactor1Key, } def get_scored_notes_cols(self) -> List[str]: @@ -48,11 +57,16 @@ def get_scored_notes_cols(self) -> List[str]: c.expansionPlusNoteFactor1Key, c.expansionPlusRatingStatusKey, c.expansionPlusInternalActiveRulesKey, + c.expansionPlusNumFinalRoundRatingsKey, ] def get_helpfulness_scores_cols(self) -> List[str]: """Returns a list of columns which should be present in the helpfulnessScores output.""" - return [] + return [ + c.raterParticipantIdKey, + c.expansionPlusRaterInterceptKey, + c.expansionPlusRaterFactor1Key, + ] def get_auxiliary_note_info_cols(self) -> List[str]: """Returns a list of columns which should be present in the auxiliaryNoteInfo output.""" @@ -76,45 +90,8 @@ def _get_dropped_note_cols(self) -> List[str]: def _get_dropped_user_cols(self) -> List[str]: """Returns a list of columns which should be excluded from helpfulnessScores output.""" return super()._get_dropped_user_cols() + [ - c.raterParticipantIdKey, - c.internalRaterInterceptKey, - c.internalRaterFactor1Key, c.crhCrnhRatioDifferenceKey, c.meanNoteScoreKey, c.raterAgreeRatioKey, c.aboveHelpfulnessThresholdKey, ] - - def _postprocess_output( - self, - noteScores: pd.DataFrame, - userScores: pd.DataFrame, - ratings: pd.DataFrame, - noteStatusHistory: pd.DataFrame, - userEnrollment: pd.DataFrame, - ) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Filter noteScores to only include notes authored by EXPANSION_PLUS users. - - Args: - noteScores: note outputs from scoring - userScores: user outputs from scoring - ratings (pd.DataFrame): preprocessed ratings - noteStatusHistory (pd.DataFrame): one row per note; history of when note had each status - userEnrollment (pd.DataFrame): one row per user specifying enrollment properties - - Returns: - Tuple[pd.DataFrame, pd.DataFrame]: - noteScores: filtered and updated note scoring output - userScores: unaltered - """ - # Identify EXPANSION_PLUS users. - expansionPlusAuthors = userEnrollment[ - userEnrollment[c.modelingPopulationKey] == c.expansionPlus - ][[c.participantIdKey]].rename(columns={c.participantIdKey: c.noteAuthorParticipantIdKey}) - # Identify note written by EXPANSION_PLUS users. - expnasionPlusNotes = noteStatusHistory.merge( - expansionPlusAuthors, on=c.noteAuthorParticipantIdKey - )[[c.noteIdKey]] - # Prune to EXPANSION_PLUS users and return. - noteScores = noteScores.merge(expnasionPlusNotes, on=c.noteIdKey) - return noteScores, userScores diff --git a/sourcecode/scoring/mf_expansion_scorer.py b/sourcecode/scoring/mf_expansion_scorer.py index 5b41037b..a292dc24 100644 --- a/sourcecode/scoring/mf_expansion_scorer.py +++ b/sourcecode/scoring/mf_expansion_scorer.py @@ -1,13 +1,10 @@ -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional from . import constants as c from .mf_base_scorer import MFBaseScorer -import numpy as np -import pandas as pd - -_EXPANSION_PLUS_BOOL = "expansionPlusBool" +_EXPANSION_BOOL = "expansionBool" class MFExpansionScorer(MFBaseScorer): @@ -17,6 +14,7 @@ def __init__( useStableInitialization: bool = True, saveIntermediateState: bool = False, threads: int = c.defaultNumThreads, + firmRejectThreshold: Optional[float] = 0.3, ) -> None: """Configure MFExpansionScorer object. @@ -25,11 +23,15 @@ def __init__( threads: number of threads to use for intra-op parallelism in pytorch """ super().__init__( - seed, + includedGroups=(c.coreGroups | c.expansionGroups), + includeUnassigned=True, + captureThreshold=0.5, + seed=seed, pseudoraters=False, useStableInitialization=useStableInitialization, saveIntermediateState=saveIntermediateState, threads=threads, + firmRejectThreshold=firmRejectThreshold, ) def get_name(self): @@ -44,6 +46,15 @@ def _get_note_col_mapping(self) -> Dict[str, str]: c.noteInterceptMinKey: c.expansionNoteInterceptMinKey, c.noteInterceptMaxKey: c.expansionNoteInterceptMaxKey, c.internalActiveRulesKey: c.expansionInternalActiveRulesKey, + c.numFinalRoundRatingsKey: c.expansionNumFinalRoundRatingsKey, + c.lowDiligenceNoteInterceptKey: c.lowDiligenceLegacyNoteInterceptKey, + } + + def _get_user_col_mapping(self) -> Dict[str, str]: + """Returns a dict mapping default user column names to custom names for a specific model.""" + return { + c.internalRaterInterceptKey: c.expansionRaterInterceptKey, + c.internalRaterFactor1Key: c.expansionRaterFactor1Key, } def get_scored_notes_cols(self) -> List[str]: @@ -56,11 +67,16 @@ def get_scored_notes_cols(self) -> List[str]: c.expansionNoteInterceptMinKey, c.expansionNoteInterceptMaxKey, c.expansionInternalActiveRulesKey, + c.expansionNumFinalRoundRatingsKey, ] def get_helpfulness_scores_cols(self) -> List[str]: """Returns a list of columns which should be present in the helpfulnessScores output.""" - return [] + return [ + c.raterParticipantIdKey, + c.expansionRaterInterceptKey, + c.expansionRaterFactor1Key, + ] def get_auxiliary_note_info_cols(self) -> List[str]: """Returns a list of columns which should be present in the auxiliaryNoteInfo output.""" @@ -82,79 +98,8 @@ def _get_dropped_note_cols(self) -> List[str]: def _get_dropped_user_cols(self) -> List[str]: """Returns a list of columns which should be excluded from helpfulnessScores output.""" return super()._get_dropped_user_cols() + [ - c.raterParticipantIdKey, - c.internalRaterInterceptKey, - c.internalRaterFactor1Key, c.crhCrnhRatioDifferenceKey, c.meanNoteScoreKey, c.raterAgreeRatioKey, c.aboveHelpfulnessThresholdKey, ] - - def _filter_input( - self, - noteTopics: pd.DataFrame, - ratingsOrig: pd.DataFrame, - noteStatusHistoryOrig: pd.DataFrame, - userEnrollment: pd.DataFrame, - ) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Prune the contents of ratings to scope model behavior. - - The MFExpansionScorer input is filtered to exclude notes and ratings from EXPANSION_PLUS - users. All other ratings are included. - - Args: - ratings (pd.DataFrame): preprocessed ratings - noteStatusHistory (pd.DataFrame): one row per note; history of when note had each status - userEnrollment (pd.DataFrame): one row per user specifying enrollment properties - - Returns: - Tuple[pd.DataFrame, pd.DataFrame]: - ratingsOrig: ratings filtered to only contain rows of interest - noteStatusHistoryOrig: noteStatusHistory filtered to only contain rows of interest - """ - # Prepare userEnrollment for join with ratings. - userEnrollment[_EXPANSION_PLUS_BOOL] = ( - userEnrollment[c.modelingPopulationKey] == c.expansionPlus - ) - userEnrollment = userEnrollment[[c.participantIdKey, _EXPANSION_PLUS_BOOL]].copy() - print("Identifying expansion notes and ratings") - # Prune notes authored by EXPANSION_PLUS users. - print(f" Total notes: {len(noteStatusHistoryOrig)}") - noteStatusHistory = noteStatusHistoryOrig.merge( - userEnrollment.rename(columns={c.participantIdKey: c.noteAuthorParticipantIdKey}), - on=c.noteAuthorParticipantIdKey, - how="left", - ) - print( - f" Notes from user without modelingPopulation: {pd.isna(noteStatusHistory[_EXPANSION_PLUS_BOOL]).sum()}" - ) - noteStatusHistory = noteStatusHistory.fillna({_EXPANSION_PLUS_BOOL: False}) - noteStatusHistory[_EXPANSION_PLUS_BOOL] = noteStatusHistory[_EXPANSION_PLUS_BOOL].astype( - np.bool8 - ) - noteStatusHistory = noteStatusHistory[~noteStatusHistory[_EXPANSION_PLUS_BOOL]] - print(f" Total CORE and EXPANSION notes: {len(noteStatusHistory)}") - # Prune ratings from EXPANSION_PLUS users. - print(f" Total ratings: {len(ratingsOrig)}") - ratings = ratingsOrig.merge( - userEnrollment.rename(columns={c.participantIdKey: c.raterParticipantIdKey}), - on=c.raterParticipantIdKey, - how="left", - ) - print( - f" Ratings from user without modelingPopulation: {pd.isna(ratings[_EXPANSION_PLUS_BOOL]).sum()}" - ) - ratings = ratings.fillna({_EXPANSION_PLUS_BOOL: False}) - ratings[_EXPANSION_PLUS_BOOL] = ratings[_EXPANSION_PLUS_BOOL].astype(np.bool8) - ratings = ratings[~ratings[_EXPANSION_PLUS_BOOL]] - print(f" Ratings after EXPANSION_PLUS filter: {len(ratings)}") - # prune ratings on dropped notes - ratings = ratings.merge( - noteStatusHistory[[c.noteIdKey]].drop_duplicates(), on=c.noteIdKey, how="inner" - ) - print(f" Ratings after EXPANSION_PLUS notes filter: {len(ratings)}") - - return ratings.drop(columns=_EXPANSION_PLUS_BOOL), noteStatusHistory.drop( - columns=_EXPANSION_PLUS_BOOL - ) diff --git a/sourcecode/scoring/mf_group_scorer.py b/sourcecode/scoring/mf_group_scorer.py index ed9d7e27..3a01b8c9 100644 --- a/sourcecode/scoring/mf_group_scorer.py +++ b/sourcecode/scoring/mf_group_scorer.py @@ -1,4 +1,4 @@ -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Set, Tuple from . import constants as c from .mf_base_scorer import MFBaseScorer, coalesce_columns @@ -13,53 +13,53 @@ trialScoringGroup = 14 # Mapping of how many threads to assign to each group scorer -_groupScorerParalleism = { +groupScorerParalleism = { # Group model 13 is larger and benefits from more threads. # Others can default to 4. 13: 8 } -def coalesce_group_models( - scoredNotes: pd.DataFrame, helpfulnessScores: pd.DataFrame -) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Coalesce all group modeling columns across note and user scoring. +def coalesce_group_model_scored_notes(scoredNotes: pd.DataFrame) -> pd.DataFrame: + """Coalesce all group modeling columns across note scoring. Since each Scorer must have distinct output columns, we use coalescing to run multiple instances of MFGroupScorer objects and then condense the results into a single set of columns. This approach works because each note will be scored by at most one MFGroupScorer instance. - - Args: - scoredNotes: scoring output for notes. - helpfulnessScores: scoring output for users. - - Returns: - tuple containing coalesced scoring results for notes and users. """ for col in [ c.groupNoteInterceptKey, c.groupNoteFactor1Key, c.groupRatingStatusKey, - c.groupNoteInterceptMaxKey, - c.groupNoteInterceptMinKey, c.modelingGroupKey, c.groupInternalActiveRulesKey, + c.groupNumFinalRoundRatingsKey, ]: scoredNotes = coalesce_columns(scoredNotes, col) + return scoredNotes + + +def coalesce_group_model_helpfulness_scores(helpfulnessScores: pd.DataFrame) -> pd.DataFrame: + """Coalesce all group modeling columns across user scoring. + + Since each Scorer must have distinct output columns, we use coalescing to run + multiple instances of MFGroupScorer objects and then condense the results into + a single set of columns. This approach works because each note will be scored + by at most one MFGroupScorer instance. + """ for col in [c.groupRaterInterceptKey, c.groupRaterFactor1Key, c.modelingGroupKey]: helpfulnessScores = coalesce_columns(helpfulnessScores, col) - - return scoredNotes, helpfulnessScores + return helpfulnessScores class MFGroupScorer(MFBaseScorer): def __init__( self, - groupNumber: int, + includedGroups: Set[int], + groupId: int, seed: Optional[int] = None, - pseudoraters: Optional[bool] = False, groupThreshold: float = 0.8, saveIntermediateState: bool = False, userFactorLambda=None, @@ -71,18 +71,20 @@ def __init__( normalizedLossHyperparameters=None, maxFirstMFTrainError: float = 0.16, maxFinalMFTrainError: float = 0.09, - requireInternalAuthor: bool = True, minMeanNoteScore: float = 0.05, crhThreshold: float = 0.40, crnhThresholdIntercept: float = -0.05, crnhThresholdNoteFactorMultiplier: float = -0.8, crnhThresholdNMIntercept: float = -0.15, - crhSuperThreshold: float = 0.5, + crhSuperThreshold: Optional[float] = 0.5, lowDiligenceThreshold: float = 0.263, factorThreshold: float = 0.5, multiplyPenaltyByHarassmentScore: bool = True, minimumHarassmentScoreToPenalize: float = 2.0, tagConsensusHarassmentHelpfulRatingPenalty: int = 10, + tagFilterPercentile: int = 95, + incorrectFilterThreshold: float = 2.5, + threads: int = 4, ) -> None: """Configure MFGroupScorer object. @@ -101,11 +103,14 @@ def __init__( for the model to be active """ super().__init__( - seed, - pseudoraters, + includedGroups=includedGroups, + includeUnassigned=False, + captureThreshold=groupThreshold, + seed=seed, + pseudoraters=False, useStableInitialization=False, saveIntermediateState=saveIntermediateState, - threads=_groupScorerParalleism.get(groupNumber, 4), + threads=threads, userFactorLambda=userFactorLambda, noteFactorLambda=noteFactorLambda, userInterceptLambda=userInterceptLambda, @@ -126,24 +131,26 @@ def __init__( multiplyPenaltyByHarassmentScore=multiplyPenaltyByHarassmentScore, minimumHarassmentScoreToPenalize=minimumHarassmentScoreToPenalize, tagConsensusHarassmentHelpfulRatingPenalty=tagConsensusHarassmentHelpfulRatingPenalty, + tagFilterPercentile=tagFilterPercentile, + incorrectFilterThreshold=incorrectFilterThreshold, ) - assert groupNumber > 0, "groupNumber must be positive. 0 is reserved for unassigned." - assert groupNumber <= groupScorerCount, "groupNumber exceeds maximum expected groups." - self._groupNumber = groupNumber - self._groupThreshold = groupThreshold - self._groupNoteInterceptKey = f"{c.groupNoteInterceptKey}_{self._groupNumber}" - self._groupNoteFactor1Key = f"{c.groupNoteFactor1Key}_{self._groupNumber}" - self._groupRatingStatusKey = f"{c.groupRatingStatusKey}_{self._groupNumber}" - self._groupNoteInterceptMaxKey = f"{c.groupNoteInterceptMaxKey}_{self._groupNumber}" - self._groupNoteInterceptMinKey = f"{c.groupNoteInterceptMinKey}_{self._groupNumber}" - self._groupInternalActiveRulesKey = f"{c.groupInternalActiveRulesKey}_{self._groupNumber}" - self._groupRaterInterceptKey = f"{c.groupRaterInterceptKey}_{self._groupNumber}" - self._groupRaterFactor1Key = f"{c.groupRaterFactor1Key}_{self._groupNumber}" - self._modelingGroupKey = f"{c.modelingGroupKey}_{self._groupNumber}" - self._requireInternalAuthor = requireInternalAuthor + assert groupId > 0, "groupNumber must be positive. 0 is reserved for unassigned." + self._groupId = groupId + self._init_column_names() + + def _init_column_names(self): + """Initialize column names based on prefixes and groupId.""" + self._groupNoteInterceptKey = f"{c.groupNoteInterceptKey}_{self._groupId}" + self._groupNoteFactor1Key = f"{c.groupNoteFactor1Key}_{self._groupId}" + self._groupRatingStatusKey = f"{c.groupRatingStatusKey}_{self._groupId}" + self._groupInternalActiveRulesKey = f"{c.groupInternalActiveRulesKey}_{self._groupId}" + self._groupNumFinalRoundRatingsKey = f"{c.groupNumFinalRoundRatingsKey}_{self._groupId}" + self._groupRaterInterceptKey = f"{c.groupRaterInterceptKey}_{self._groupId}" + self._groupRaterFactor1Key = f"{c.groupRaterFactor1Key}_{self._groupId}" + self._modelingGroupKey = f"{c.modelingGroupKey}_{self._groupId}" def get_name(self): - return f"MFGroupScorer_{self._groupNumber}" + return f"MFGroupScorer_{self._groupId}" def _get_note_col_mapping(self) -> Dict[str, str]: """Returns a dict mapping default note column names to custom names for a specific model.""" @@ -151,9 +158,9 @@ def _get_note_col_mapping(self) -> Dict[str, str]: c.internalNoteInterceptKey: self._groupNoteInterceptKey, c.internalNoteFactor1Key: self._groupNoteFactor1Key, c.internalRatingStatusKey: self._groupRatingStatusKey, - c.noteInterceptMinKey: self._groupNoteInterceptMinKey, - c.noteInterceptMaxKey: self._groupNoteInterceptMaxKey, c.internalActiveRulesKey: self._groupInternalActiveRulesKey, + c.numFinalRoundRatingsKey: self._groupNumFinalRoundRatingsKey, + c.lowDiligenceNoteInterceptKey: c.lowDiligenceLegacyNoteInterceptKey, } def _get_user_col_mapping(self) -> Dict[str, str]: @@ -170,10 +177,9 @@ def get_scored_notes_cols(self) -> List[str]: self._groupNoteInterceptKey, self._groupNoteFactor1Key, self._groupRatingStatusKey, - self._groupNoteInterceptMaxKey, - self._groupNoteInterceptMinKey, self._groupInternalActiveRulesKey, self._modelingGroupKey, + self._groupNumFinalRoundRatingsKey, ] def get_helpfulness_scores_cols(self) -> List[str]: @@ -195,6 +201,8 @@ def _get_dropped_note_cols(self) -> List[str]: [ c.activeFilterTagsKey, c.ratingWeightKey, + c.noteInterceptMinKey, + c.noteInterceptMaxKey, ] + c.notHelpfulTagsAdjustedColumns + c.notHelpfulTagsAdjustedRatioColumns @@ -211,43 +219,6 @@ def _get_dropped_user_cols(self) -> List[str]: c.aboveHelpfulnessThresholdKey, ] - def _filter_input( - self, - noteTopics: pd.DataFrame, - ratings: pd.DataFrame, - noteStatusHistory: pd.DataFrame, - userEnrollment: pd.DataFrame, - ) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Prune the contents of ratings to only include ratings from users in the modeling group. - - This function identifies the subset of ratings to include in group model scoring. - To improve modeling within the group, we only include ratings from users in the modeling - group. However, we place no restriction on which notes to include in the model and instead - include ratings on any note. Including ratings on any note increases the amount of data - available during training about each user, in effect also increasing the number of users - and notes we are able to include in the model. - - Including notes by users outside of the modeling group means that the model will issue - scores for notes which do not meet group modeling criteria (i.e. >80% of ratings are - from users in the modeling group, and the author is also from the modeling group). We - enforce these criteria *after* scoring in _postprocess_output so that the maximum amount - of ratings are available during scoring. - - Args: - ratings (pd.DataFrame): preprocessed ratings - noteStatusHistory (pd.DataFrame): one row per note; history of when note had each status - userEnrollment (pd.DataFrame): one row per user specifying enrollment properties - - Returns: - Tuple[pd.DataFrame, pd.DataFrame]: - ratings: ratings filtered to only contain rows of interest - noteStatusHistory: noteStatusHistory filtered to only contain rows of interest - """ - userEnrollment = userEnrollment.rename(columns={c.participantIdKey: c.raterParticipantIdKey}) - userEnrollment = userEnrollment[userEnrollment[c.modelingGroupKey] == self._groupNumber] - ratings = ratings.merge(userEnrollment[[c.raterParticipantIdKey]].drop_duplicates()) - return ratings, noteStatusHistory - def _postprocess_output( self, noteScores: pd.DataFrame, @@ -275,27 +246,9 @@ def _postprocess_output( noteScores: filtered and updated note scoring output userScores: filtered and updated user scoring output """ - # Prune notes according to authorship filter. - if self._requireInternalAuthor: - noteScores = noteScores.merge( - userEnrollment[[c.participantIdKey, c.modelingGroupKey]].rename( - columns={c.participantIdKey: c.noteAuthorParticipantIdKey} - ), - how="left", - ) - noteScores = noteScores[noteScores[c.modelingGroupKey] == self._groupNumber] - noteScores = noteScores.drop(columns=c.modelingGroupKey) - # Identify notes with enough ratings from within the modeling group. - ratings = ratings.merge( - userEnrollment[[c.participantIdKey, c.modelingGroupKey]].rename( - columns={c.participantIdKey: c.raterParticipantIdKey} - ), - how="left", + noteScores, userScores = super()._postprocess_output( + noteScores, userScores, ratings, noteStatusHistory, userEnrollment ) - ratings["inGroup"] = ratings[c.modelingGroupKey] == self._groupNumber - ratios = ratings[[c.noteIdKey, "inGroup"]].groupby(c.noteIdKey).mean().reset_index() - notesAboveThreshold = ratios[ratios["inGroup"] >= self._groupThreshold][[c.noteIdKey]] - noteScores = noteScores.merge(notesAboveThreshold) # Note that even though ratings were restricted to the modeling group, users outside of # the modeling group may still have authored a note which was rated and may consequently # appear in the userScores. Accordingly, we drop any user which was outside of the @@ -306,9 +259,9 @@ def _postprocess_output( ), how="left", ) - userScores = userScores[userScores[c.modelingGroupKey] == self._groupNumber] + userScores = userScores[userScores[c.modelingGroupKey].isin(self._includedGroups)] userScores = userScores.drop(columns=c.modelingGroupKey) # Set the modelingGroupKey column in each output - noteScores[self._modelingGroupKey] = self._groupNumber - userScores[self._modelingGroupKey] = self._groupNumber + noteScores[self._modelingGroupKey] = self._groupId + userScores[self._modelingGroupKey] = self._groupId return noteScores, userScores diff --git a/sourcecode/scoring/mf_multi_group_scorer.py b/sourcecode/scoring/mf_multi_group_scorer.py new file mode 100644 index 00000000..736e6f9f --- /dev/null +++ b/sourcecode/scoring/mf_multi_group_scorer.py @@ -0,0 +1,54 @@ +from . import constants as c +from .mf_base_scorer import coalesce_columns +from .mf_group_scorer import MFGroupScorer + +import pandas as pd + + +def coalesce_multi_group_model_scored_notes(scoredNotes: pd.DataFrame) -> pd.DataFrame: + """Coalesce all multi group modeling columns across note scoring. + + Since each Scorer must have distinct output columns, we use coalescing to run + multiple instances of MFGroupScorer objects and then condense the results into + a single set of columns. This approach works because each note will be scored + by at most one MFGroupScorer instance. + """ + for col in [ + c.multiGroupNoteInterceptKey, + c.multiGroupNoteFactor1Key, + c.multiGroupRatingStatusKey, + c.modelingMultiGroupKey, + c.multiGroupInternalActiveRulesKey, + c.multiGroupNumFinalRoundRatingsKey, + ]: + scoredNotes = coalesce_columns(scoredNotes, col) + + return scoredNotes + + +def coalesce_multi_group_model_helpfulness_scores(helpfulnessScores: pd.DataFrame) -> pd.DataFrame: + """Coalesce all group modeling columns across user scoring. + + Since each Scorer must have distinct output columns, we use coalescing to run + multiple instances of MFGroupScorer objects and then condense the results into + a single set of columns. This approach works because each note will be scored + by at most one MFGroupScorer instance. + """ + for col in [c.multiGroupRaterInterceptKey, c.multiGroupRaterFactor1Key, c.modelingMultiGroupKey]: + helpfulnessScores = coalesce_columns(helpfulnessScores, col) + return helpfulnessScores + + +class MFMultiGroupScorer(MFGroupScorer): + def _init_column_names(self): + self._groupNoteInterceptKey = f"{c.multiGroupNoteInterceptKey}_{self._groupId}" + self._groupNoteFactor1Key = f"{c.multiGroupNoteFactor1Key}_{self._groupId}" + self._groupRatingStatusKey = f"{c.multiGroupRatingStatusKey}_{self._groupId}" + self._groupInternalActiveRulesKey = f"{c.multiGroupInternalActiveRulesKey}_{self._groupId}" + self._groupNumFinalRoundRatingsKey = f"{c.multiGroupNumFinalRoundRatingsKey}_{self._groupId}" + self._groupRaterInterceptKey = f"{c.multiGroupRaterInterceptKey}_{self._groupId}" + self._groupRaterFactor1Key = f"{c.multiGroupRaterFactor1Key}_{self._groupId}" + self._modelingGroupKey = f"{c.modelingMultiGroupKey}_{self._groupId}" + + def get_name(self): + return f"MFMultiGroupScorer_{self._groupId}" diff --git a/sourcecode/scoring/mf_topic_scorer.py b/sourcecode/scoring/mf_topic_scorer.py index dd11a51b..22c04f8d 100644 --- a/sourcecode/scoring/mf_topic_scorer.py +++ b/sourcecode/scoring/mf_topic_scorer.py @@ -27,6 +27,7 @@ def coalesce_topic_models(scoredNotes: pd.DataFrame) -> pd.DataFrame: c.topicNoteConfidentKey, c.noteTopicKey, c.topicInternalActiveRulesKey, + c.topicNumFinalRoundRatingsKey, ]: scoredNotes = coalesce_columns(scoredNotes, col) @@ -76,8 +77,9 @@ def __init__( pseudoraters: if True, compute optional pseudorater confidence intervals """ super().__init__( - seed, - pseudoraters, + includedTopics={topicName}, + seed=seed, + pseudoraters=pseudoraters, useStableInitialization=False, saveIntermediateState=saveIntermediateState, threads=4, @@ -108,6 +110,7 @@ def __init__( self._topicNoteFactor1Key = f"{c.topicNoteFactor1Key}_{self._topicName}" self._topicRatingStatusKey = f"{c.topicRatingStatusKey}_{self._topicName}" self._topicInternalActiveRulesKey = f"{c.topicInternalActiveRulesKey}_{self._topicName}" + self._topicNumFinalRoundRatingsKey = f"{c.topicNumFinalRoundRatingsKey}_{self._topicName}" self._noteTopicKey = f"{c.noteTopicKey}_{self._topicName}" self._noteTopicConfidentKey = f"{c.topicNoteConfidentKey}_{self._topicName}" @@ -121,6 +124,8 @@ def _get_note_col_mapping(self) -> Dict[str, str]: c.internalNoteFactor1Key: self._topicNoteFactor1Key, c.internalRatingStatusKey: self._topicRatingStatusKey, c.internalActiveRulesKey: self._topicInternalActiveRulesKey, + c.numFinalRoundRatingsKey: self._topicNumFinalRoundRatingsKey, + c.lowDiligenceNoteInterceptKey: c.lowDiligenceLegacyNoteInterceptKey, } def get_scored_notes_cols(self) -> List[str]: @@ -133,6 +138,7 @@ def get_scored_notes_cols(self) -> List[str]: self._noteTopicKey, self._noteTopicConfidentKey, self._topicInternalActiveRulesKey, + self._topicNumFinalRoundRatingsKey, ] def get_helpfulness_scores_cols(self) -> List[str]: @@ -170,31 +176,6 @@ def _get_dropped_user_cols(self) -> List[str]: c.raterParticipantIdKey, ] - def _filter_input( - self, - noteTopics: pd.DataFrame, - ratings: pd.DataFrame, - noteStatusHistory: pd.DataFrame, - userEnrollment: pd.DataFrame, - ) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Prune the contents of ratings to only include ratings from notes on this topic. - - Args: - noteTopics: DF pairing notes and topics - ratings (pd.DataFrame): preprocessed ratings - noteStatusHistory (pd.DataFrame): one row per note; history of when note had each status - userEnrollment (pd.DataFrame): one row per user specifying enrollment properties - - Returns: - Tuple[pd.DataFrame, pd.DataFrame]: - ratings: ratings filtered to only contain rows of interest - noteStatusHistory: noteStatusHistory filtered to only contain rows of interest - """ - notes = noteTopics[noteTopics[c.noteTopicKey] == self._topicName][[c.noteIdKey]] - ratings = ratings.merge(notes) - noteStatusHistory = noteStatusHistory.merge(notes) - return ratings, noteStatusHistory - def _postprocess_output( self, noteScores: pd.DataFrame, @@ -250,6 +231,8 @@ def _postprocess_output( negFactorCounts = negFactorCounts[negFactorCounts["negRatingTotal"] > 4][[c.noteIdKey]] confidentNotes = posFactorCounts.merge(negFactorCounts) confidentNotes[self._noteTopicConfidentKey] = True - noteScores = noteScores.merge(confidentNotes, how="left") + noteScores = noteScores.merge( + confidentNotes, how="left", unsafeAllowed=[self._noteTopicConfidentKey, c.defaultIndexKey] + ) noteScores = noteScores.fillna({self._noteTopicConfidentKey: False}) return noteScores, userScores diff --git a/sourcecode/scoring/note_ratings.py b/sourcecode/scoring/note_ratings.py index 922e4aea..e5930c99 100644 --- a/sourcecode/scoring/note_ratings.py +++ b/sourcecode/scoring/note_ratings.py @@ -1,5 +1,6 @@ from datetime import datetime, timedelta, timezone -from typing import Callable, Optional +import logging +from typing import Callable, Dict, Optional from . import constants as c, incorrect_filter, scoring_rules, tag_filter from .scoring_rules import RuleID @@ -8,6 +9,10 @@ import pandas as pd +logger = logging.getLogger("birdwatch.note_ratings") +logger.setLevel(logging.INFO) + + # Threshold limiting the number of ratings which can be counted as "valid" for the purpose of # determining rating performance for notes which were created before noteStatusHistory was # introduced. Notice that this value coincides with the minimum number of ratings necessary to @@ -43,7 +48,7 @@ def is_crnh_diamond( def get_ratings_before_note_status_and_public_tsv( ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, - logging: bool = True, + log: bool = True, doTypeCheck: bool = True, ) -> pd.DataFrame: """Determine which ratings are made before note's most recent non-NMR status, @@ -54,7 +59,7 @@ def get_ratings_before_note_status_and_public_tsv( Args: ratings (pd.DataFrame) noteStatusHistory (pd.DataFrame) - logging (bool, optional). Defaults to True. + log (bool, optional). Defaults to True. doTypeCheck (bool): do asserts to check types. Returns: pd.DataFrame combinedRatingsBeforeStatus ratings that were created early enough to be valid ratings @@ -69,6 +74,7 @@ def get_ratings_before_note_status_and_public_tsv( on=c.noteIdKey, how="left", suffixes=("", right_suffix), + unsafeAllowed={c.createdAtMillisKey}, ) # Note that the column types for c.createdAtMillisKey and # c.timestampMillisOfNoteMostRecentNonNMRLabelKey are determined at runtime and cannot be statically @@ -94,9 +100,9 @@ def get_ratings_before_note_status_and_public_tsv( assert len(ratingsWithNoteLabelInfo) == len(ratings) mismatches = [ - (c, dtype, ratingsWithNoteLabelInfoTypes[c]) - for c, dtype in zip(ratingsWithNoteLabelInfo, ratingsWithNoteLabelInfo.dtypes) - if dtype != ratingsWithNoteLabelInfoTypes[c] + (col, dtype, ratingsWithNoteLabelInfoTypes[col]) + for col, dtype in zip(ratingsWithNoteLabelInfo, ratingsWithNoteLabelInfo.dtypes) + if ("participantid" not in col.lower()) and (dtype != ratingsWithNoteLabelInfoTypes[col]) ] assert not len(mismatches), f"Mismatch columns: {mismatches}" @@ -139,11 +145,11 @@ def get_ratings_before_note_status_and_public_tsv( combinedRatingsBeforeStatus = pd.concat([ratingsBeforeStatusNewNotes, first5RatingsOldNotes]) - if logging: - print( + if log: + logger.info( f"Total ratings: {np.invert(noteCreatedBeforeNoteStatusHistory).sum()} post-tombstones and {(noteCreatedBeforeNoteStatusHistory).sum()} pre-tombstones" ) - print( + logger.info( f"Total ratings created before statuses: {len(combinedRatingsBeforeStatus)}, including {len(ratingsBeforeStatusNewNotes)} post-tombstones and {len(first5RatingsOldNotes)} pre-tombstones." ) @@ -155,7 +161,7 @@ def get_ratings_with_scores( ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, scoredNotes: pd.DataFrame, - logging: bool = True, + log: bool = True, doTypeCheck: bool = True, ) -> pd.DataFrame: """ @@ -169,7 +175,7 @@ def get_ratings_with_scores( pd.DataFrame: binaryRatingsOnNotesWithStatusLabels Binary ratings with status labels """ ratingsBeforeNoteStatus = get_ratings_before_note_status_and_public_tsv( - ratings, noteStatusHistory, logging, doTypeCheck + ratings, noteStatusHistory, log, doTypeCheck ) ratingsWithScores = ratingsBeforeNoteStatus[ @@ -192,7 +198,7 @@ def get_valid_ratings( ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, scoredNotes: pd.DataFrame, - logging: bool = True, + log: bool = True, doTypeCheck: bool = True, ) -> pd.DataFrame: """Determine which ratings are "valid" (used to determine rater helpfulness score) @@ -203,13 +209,13 @@ def get_valid_ratings( ratings (pd.DataFrame) noteStatusHistory (pd.DataFrame) scoredNotes (pd.DataFrame) - logging (bool, optional): Defaults to True. + log (bool, optional): Defaults to True. doTypeCheck (bool): do asserts to check types. Returns: pd.DataFrame: binaryRatingsOnNotesWithStatusLabels CRH/CRNH notes group by helpfulness """ ratingsWithScores = get_ratings_with_scores( - ratings, noteStatusHistory, scoredNotes, logging, doTypeCheck + ratings, noteStatusHistory, scoredNotes, log, doTypeCheck ) ratingsWithScores[c.ratingCountKey] = 1 @@ -270,8 +276,8 @@ def get_valid_ratings( helpfulRatingOnCrhNote | notHelpfulRatingOnCrnhNote, c.ratingAgreesWithNoteStatusKey ] = True - if logging: - print(f"Total valid ratings: {len(binaryRatingsOnNotesWithStatusLabels)}") + if log: + logger.info(f"Total valid ratings: {len(binaryRatingsOnNotesWithStatusLabels)}") return binaryRatingsOnNotesWithStatusLabels @@ -326,6 +332,14 @@ def compute_note_stats(ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame) - ], on=c.noteIdKey, how="outer", + unsafeAllowed=set( + [ + c.numRatingsKey, + c.numRatingsLast28DaysKey, + ] + + c.helpfulTagsTSVOrder + + c.notHelpfulTagsTSVOrder + ), ) # Fill in nan values resulting from the outer merge with zero since these values were not @@ -359,8 +373,10 @@ def compute_scored_notes( crnhThresholdNoteFactorMultiplier: float, crnhThresholdNMIntercept: float, crnhThresholdUCBIntercept: float, - crhSuperThreshold: float, + crhSuperThreshold: Optional[float], inertiaDelta: float, + tagFilterThresholds: Optional[Dict[str, float]], + incorrectFilterThreshold: float, finalRound: bool = False, # TODO: We might want to consider inputing only the series here, instead of the whole callable is_crh_function: Callable[..., pd.Series] = is_crh, @@ -368,6 +384,7 @@ def compute_scored_notes( is_crnh_ucb_function: Callable[..., pd.Series] = is_crnh_ucb, lowDiligenceThreshold: float = 0.263, factorThreshold: float = 0.5, + firmRejectThreshold: Optional[float] = None, ) -> pd.DataFrame: """ Merges note status history, ratings, and model output. It annotes the data frame with @@ -414,11 +431,16 @@ def compute_scored_notes( # Merge with noteParams as necessary noteParamsColsToKeep = [c.noteIdKey, c.internalNoteInterceptKey, c.internalNoteFactor1Key] if finalRound: - noteParamsColsToKeep += [c.lowDiligenceInterceptKey] + noteParamsColsToKeep += [c.lowDiligenceNoteInterceptKey] for col in c.noteParameterUncertaintyTSVColumns: if col in noteParams.columns: noteParamsColsToKeep.append(col) - noteStats = noteStats.merge(noteParams[noteParamsColsToKeep], on=c.noteIdKey, how="left") + noteStats = noteStats.merge( + noteParams[noteParamsColsToKeep], + on=c.noteIdKey, + how="left", + unsafeAllowed={"ratingCount_all", "ratingCount_neg_fac", "ratingCount_pos_fac"}, + ) rules = [ scoring_rules.DefaultRule(RuleID.INITIAL_NMR, set(), c.needsMoreRatings), @@ -452,14 +474,38 @@ def compute_scored_notes( ), ] if finalRound: - # Compute tag aggregates only if they are required for tag filtering. - tagAggregates = tag_filter.get_note_tag_aggregates(ratings, noteParams, raterParams) - assert len(tagAggregates) == len(noteParams), "there should be one aggregate per scored note" - noteStats = tagAggregates.merge(noteStats, on=c.noteIdKey, how="outer") - incorrectAggregates = incorrect_filter.get_incorrect_aggregates( - ratings, noteParams, raterParams - ) - noteStats = noteStats.merge(incorrectAggregates, on=c.noteIdKey, how="outer") + with c.time_block("compute_scored_notes: compute tag aggregates"): + # Compute tag aggregates only if they are required for tag filtering. + tagAggregates = tag_filter.get_note_tag_aggregates(ratings, noteParams, raterParams) + + # set pandas option to display all columns + pd.set_option("display.max_columns", None) + assert len(tagAggregates) == len(noteParams), f"""there should be one aggregate per scored note + len(noteParams) == {len(noteParams)}; len(np.unique(noteParams[c.noteIdKey])) == {len(np.unique(noteParams[c.noteIdKey]))} + len(tagAggregates) == {len(tagAggregates)}; len(np.unique(tagAggregates[c.noteIdKey])) == {len(np.unique(tagAggregates[c.noteIdKey]))} + + The first 30 notes that appear in noteParams but not in tagAggregates are: + {noteParams[~noteParams[c.noteIdKey].isin(tagAggregates[c.noteIdKey])].head(30)} + + The first 30 notes that appear in tagAggregates but not in noteParams are: + {tagAggregates[~tagAggregates[c.noteIdKey].isin(noteParams[c.noteIdKey])].head(30)} + """ + + noteStats = tagAggregates.merge(noteStats, on=c.noteIdKey, how="outer") + with c.time_block("compute_scored_notes: compute incorrect aggregates"): + incorrectAggregates = incorrect_filter.get_incorrect_aggregates_final_scoring( + ratings, noteParams, raterParams + ) + noteStats = noteStats.merge( + incorrectAggregates, + on=c.noteIdKey, + how="outer", + unsafeAllowed={ + c.notHelpfulIncorrectIntervalKey, + c.numVotersIntervalKey, + }, + ) + assert tagFilterThresholds is not None # Add tag filtering and sticky scoring logic. rules.extend( @@ -475,47 +521,64 @@ def compute_scored_notes( scoring_rules.FilterTagOutliers( RuleID.TAG_OUTLIER, {RuleID.GENERAL_CRH}, - c.needsMoreRatings, - crhSuperThreshold, - ), - scoring_rules.RuleFromFunction( - RuleID.ELEVATED_CRH, - {RuleID.INITIAL_NMR}, - c.currentlyRatedHelpful, - lambda noteStats: is_crh_function(noteStats, minRatingsNeeded, crhSuperThreshold), - onlyApplyToNotesThatSayTweetIsMisleading=True, - ), - scoring_rules.AddCRHInertia( - RuleID.ELEVATED_CRH_INERTIA, - {RuleID.TAG_OUTLIER}, - c.currentlyRatedHelpful, - crhSuperThreshold - inertiaDelta, - crhSuperThreshold, - minRatingsNeeded, + c.firmReject if firmRejectThreshold is not None else c.needsMoreRatings, + tagFilterThresholds=tagFilterThresholds, ), + ] + ) + if crhSuperThreshold is not None: + rules.extend( + [ + scoring_rules.RuleFromFunction( + RuleID.ELEVATED_CRH, + {RuleID.INITIAL_NMR}, + c.currentlyRatedHelpful, + lambda noteStats: is_crh_function(noteStats, minRatingsNeeded, crhSuperThreshold), + onlyApplyToNotesThatSayTweetIsMisleading=True, + ), + scoring_rules.AddCRHInertia( + RuleID.ELEVATED_CRH_INERTIA, + {RuleID.TAG_OUTLIER}, + c.currentlyRatedHelpful, + crhSuperThreshold - inertiaDelta, + crhSuperThreshold, + minRatingsNeeded, + ), + ] + ) + rules.extend( + [ scoring_rules.FilterIncorrect( RuleID.INCORRECT_OUTLIER, {RuleID.TAG_OUTLIER}, - c.needsMoreRatings, + c.firmReject if firmRejectThreshold is not None else c.needsMoreRatings, tagThreshold=2, voteThreshold=3, - weightedTotalVotes=2.5, - superThreshold=None, + weightedTotalVotes=incorrectFilterThreshold, ), scoring_rules.FilterLowDiligence( RuleID.LOW_DILIGENCE, {RuleID.INCORRECT_OUTLIER}, - c.needsMoreRatings, + c.firmReject if firmRejectThreshold is not None else c.needsMoreRatings, interceptThreshold=lowDiligenceThreshold, ), scoring_rules.FilterLargeFactor( RuleID.LARGE_FACTOR, {RuleID.LOW_DILIGENCE}, - c.needsMoreRatings, + c.firmReject if firmRejectThreshold is not None else c.needsMoreRatings, factorThreshold=factorThreshold, ), ] ) + if firmRejectThreshold is not None: + rules.append( + scoring_rules.RejectLowIntercept( + RuleID.LOW_INTERCEPT, + {RuleID.LARGE_FACTOR}, + c.firmReject, + firmRejectThreshold, + ) + ) scoredNotes = scoring_rules.apply_scoring_rules( noteStats, rules, c.internalRatingStatusKey, c.internalActiveRulesKey ) diff --git a/sourcecode/scoring/note_status_history.py b/sourcecode/scoring/note_status_history.py index 5226af4c..9a686a75 100644 --- a/sourcecode/scoring/note_status_history.py +++ b/sourcecode/scoring/note_status_history.py @@ -1,4 +1,6 @@ +import logging import time +from typing import Optional from . import constants as c from .scoring_rules import RuleID @@ -7,6 +9,9 @@ import pandas as pd +logger = logging.getLogger("birdwatch.note_status_history") +logger.setLevel(logging.INFO) + # Delay specifying when to lock note status, currently set to two weeks. _noteLockMillis = 14 * 24 * 60 * 60 * 1000 @@ -32,9 +37,10 @@ def merge_note_info(oldNoteStatusHistory: pd.DataFrame, notes: pd.DataFrame) -> # use outer so we don't drop deleted notes from "oldNoteStatusHistory" or new notes from "notes" how="outer", suffixes=("", noteSuffix), + unsafeAllowed={c.createdAtMillisKey, c.createdAtMillisKey + noteSuffix}, ) newNotes = pd.isna(newNoteStatusHistory[c.createdAtMillisKey]) - print(f"total notes added to noteStatusHistory: {sum(newNotes)}") + logger.info(f"total notes added to noteStatusHistory: {sum(newNotes)}") # Copy timestamp and authorship data over for new notes. newNoteStatusHistory.loc[newNotes, c.createdAtMillisKey] = newNoteStatusHistory.loc[ newNotes, c.createdAtMillisKey + noteSuffix @@ -55,6 +61,7 @@ def merge_note_info(oldNoteStatusHistory: pd.DataFrame, notes: pd.DataFrame) -> notes[[c.noteIdKey, c.createdAtMillisKey]], on=[c.noteIdKey, c.createdAtMillisKey], how="inner", + unsafeAllowed=c.createdAtMillisKey, ) ), "timestamps from notes and noteStatusHistory must match" assert len(notes) == len( @@ -86,6 +93,34 @@ def _update_single_note_status_history(mergedNote, currentTimeMillis, newScoredN Returns: row of pd.DataFrame """ + # This TS will be set by run_combine_scoring_outputs. + mergedNote[c.timestampMinuteOfFinalScoringOutput] = np.nan + + # TODO(jiansongc): remove after new column is in prod. + if c.timestampMillisOfFirstNmrDueToMinStableCrhTimeKey not in mergedNote: + mergedNote[c.timestampMillisOfFirstNmrDueToMinStableCrhTimeKey] = np.nan + + if not pd.isna(mergedNote[c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey]): + mergedNote[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] = mergedNote[ + c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey + ] + if pd.isna(mergedNote[c.timestampMillisOfFirstNmrDueToMinStableCrhTimeKey]): + mergedNote[c.timestampMillisOfFirstNmrDueToMinStableCrhTimeKey] = mergedNote[ + c.timestampMillisOfNmrDueToMinStableCrhTimeKey + ] + + if mergedNote[c.finalRatingStatusKey] != mergedNote[c.currentLabelKey]: + # Changed status vs. previous run: + mergedNote[c.timestampMillisOfMostRecentStatusChangeKey] = currentTimeMillis + else: + # No change in status vs. previous run + # If the note has not changed status (since the launch of this feature on 2024/07/02), + # then the timestamp of the most recent status change should be set to -1 by default. + if c.timestampMillisOfMostRecentStatusChangeKey not in mergedNote.index: + mergedNote[c.timestampMillisOfMostRecentStatusChangeKey] = -1 + elif pd.isna(mergedNote[c.timestampMillisOfMostRecentStatusChangeKey]): + mergedNote[c.timestampMillisOfMostRecentStatusChangeKey] = -1 + # Update the current status in accordance with this scoring run. assert not pd.isna(mergedNote[c.finalRatingStatusKey]) mergedNote[c.currentLabelKey] = mergedNote[c.finalRatingStatusKey] @@ -95,6 +130,8 @@ def _update_single_note_status_history(mergedNote, currentTimeMillis, newScoredN mergedNote[c.currentDecidedByKey] = mergedNote[c.decidedByKey] mergedNote[c.currentModelingGroupKey] = mergedNote[c.modelingGroupKey] mergedNote[c.timestampMillisOfNoteCurrentLabelKey] = currentTimeMillis + mergedNote[c.currentMultiGroupStatusKey] = mergedNote[c.multiGroupRatingStatusKey] + mergedNote[c.currentModelingMultiGroupKey] = mergedNote[c.modelingMultiGroupKey] # Lock notes which are (1) not already locked, (2) old enough to lock and (3) # were decided by logic which has global display impact. Criteria (3) guarantees @@ -120,6 +157,10 @@ def _update_single_note_status_history(mergedNote, currentTimeMillis, newScoredN mergedNote[c.lockedStatusKey] = mergedNote[c.finalRatingStatusKey] mergedNote[c.timestampMillisOfStatusLockKey] = currentTimeMillis + # Clear timestampMillisOfNmrDueToMinStableCrhTimeKey if the note is locked. + if pd.notna(mergedNote[c.lockedStatusKey]): + mergedNote[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] = -1 + if pd.isna(mergedNote[c.createdAtMillisKey + newScoredNotesSuffix]): # note used to be scored but isn't now; just retain old info return mergedNote @@ -155,7 +196,7 @@ def _update_single_note_status_history(mergedNote, currentTimeMillis, newScoredN return mergedNote -def _check_flips(mergedStatuses: pd.DataFrame, maxCrhChurn=0.25) -> None: +def check_flips(mergedStatuses: pd.DataFrame, noteSubset: c.NoteSubset) -> None: """Validate that number of CRH notes remains within an accepted bound. Assert fails and scoring exits with error if maximum allowable churn is exceeded. @@ -167,8 +208,28 @@ def _check_flips(mergedStatuses: pd.DataFrame, maxCrhChurn=0.25) -> None: Returns: None """ - # Prune to unlocked notes. - mergedStatuses = mergedStatuses[mergedStatuses[c.timestampMillisOfStatusLockKey].isna()] + if len(mergedStatuses) > c.minNumNotesForProdData: + # Prune notes to unlocked notes. + mergedStatuses = mergedStatuses[mergedStatuses[c.timestampMillisOfStatusLockKey].isna()] + # Prune to note subset + logger.info( + f"Checking Flip Rate for note subset: {noteSubset.description} (unlocked only), with max new CRH churn: {noteSubset.maxNewCrhChurnRate}, and max old CRH churn: {noteSubset.maxOldCrhChurnRate}" + ) + if noteSubset.noteSet is not None: + mergedStatuses = mergedStatuses[mergedStatuses[c.noteIdKey].isin(noteSubset.noteSet)] + + _check_flips(mergedStatuses, noteSubset.maxNewCrhChurnRate, noteSubset.maxOldCrhChurnRate) + + +def _check_flips( + mergedStatuses: pd.DataFrame, + maxNewCrhChurn: float, + maxOldCrhChurn: Optional[float] = None, + smoothingCount: int = 100, +) -> None: + if maxOldCrhChurn is None: + maxOldCrhChurn = maxNewCrhChurn + # Identify new and old CRH notes. oldCrhNotes = frozenset( mergedStatuses[mergedStatuses[c.currentLabelKey] == c.currentlyRatedHelpful][c.noteIdKey] @@ -176,35 +237,32 @@ def _check_flips(mergedStatuses: pd.DataFrame, maxCrhChurn=0.25) -> None: newCrhNotes = frozenset( mergedStatuses[mergedStatuses[c.finalRatingStatusKey] == c.currentlyRatedHelpful][c.noteIdKey] ) - # Validate that changes are within allowable bounds. - assert ( - (len(newCrhNotes - oldCrhNotes) / len(oldCrhNotes)) < maxCrhChurn - ), f"Too many new CRH notes: newCrhNotes={len(newCrhNotes)}, oldCrhNotes={len(oldCrhNotes)}, delta={len(newCrhNotes - oldCrhNotes)}" - assert ( - (len(oldCrhNotes - newCrhNotes) / len(oldCrhNotes)) < maxCrhChurn - ), f"Too few new CRH notes: newCrhNotes={len(newCrhNotes)}, oldCrhNotes={len(oldCrhNotes)}, delta={len(oldCrhNotes - newCrhNotes)}" + if len(oldCrhNotes) > 0 and len(newCrhNotes) > 0: + # Validate that changes are within allowable bounds. + smoothedNewNoteRatio = (len(newCrhNotes - oldCrhNotes)) / (len(oldCrhNotes) + smoothingCount) + rawNewNoteRatio = (len(newCrhNotes - oldCrhNotes)) / len(oldCrhNotes) + logger.info( + f"Raw new note ratio: {rawNewNoteRatio}, smoothed new note ratio: {smoothedNewNoteRatio}. (newCrhNotes={len(newCrhNotes)}, oldCrhNotes={len(oldCrhNotes)}, delta={len(newCrhNotes - oldCrhNotes)}" + ) + smoothedOldNoteRatio = (len(oldCrhNotes - newCrhNotes)) / (len(oldCrhNotes) + smoothingCount) + rawOldNoteRatio = (len(oldCrhNotes - newCrhNotes)) / len(oldCrhNotes) + logger.info( + f"Raw old note ratio: {rawOldNoteRatio}, smoothed old note ratio: {smoothedOldNoteRatio}. (newCrhNotes={len(newCrhNotes)}, oldCrhNotes={len(oldCrhNotes)}, delta={len(oldCrhNotes - newCrhNotes)}" + ) + assert ( + smoothedNewNoteRatio < maxNewCrhChurn + ), f"Too many new CRH notes: newCrhNotes={len(newCrhNotes)}, oldCrhNotes={len(oldCrhNotes)}, delta={len(newCrhNotes - oldCrhNotes)}" -def update_note_status_history( - oldNoteStatusHistory: pd.DataFrame, - scoredNotes: pd.DataFrame, -) -> pd.DataFrame: - """Generate new noteStatusHistory by merging in new note labels. + assert ( + smoothedOldNoteRatio < maxOldCrhChurn + ), f"Too many notes lost CRH status: oldCrhNotes={len(oldCrhNotes)}, newCrhNotes={len(newCrhNotes)}, delta={len(oldCrhNotes - newCrhNotes)}" - Args: - oldNoteStatusHistory (pd.DataFrame) - scoredNotes (pd.DataFrame) - Returns: - pd.DataFrame: noteStatusHistory - """ - if c.useCurrentTimeInsteadOfEpochMillisForNoteStatusHistory: - # When running in prod, we use the latest time possible, so as to include as many valid ratings - # as possible, and be closest to the time the new note statuses are user-visible. - currentTimeMillis = 1000 * time.time() - else: - # When running in test, we use the overridable epochMillis constant. - currentTimeMillis = c.epochMillis +def merge_old_and_new_note_statuses( + oldNoteStatusHistory: pd.DataFrame, + scoredNotes: pd.DataFrame, +): newScoredNotesSuffix = "_sn" mergedStatuses = oldNoteStatusHistory.merge( scoredNotes[ @@ -217,6 +275,9 @@ def update_note_status_history( c.expansionRatingStatusKey, c.groupRatingStatusKey, c.modelingGroupKey, + c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey, + c.multiGroupRatingStatusKey, + c.modelingMultiGroupKey, ] ].rename( { @@ -230,8 +291,21 @@ def update_note_status_history( assert len(mergedStatuses) == len( oldNoteStatusHistory ), "scoredNotes and oldNoteStatusHistory should both contain all notes" - if len(mergedStatuses) > c.minNumNotesForProdData: - _check_flips(mergedStatuses) + return mergedStatuses + + +def update_note_status_history( + mergedStatuses: pd.DataFrame, + newScoredNotesSuffix: str = "_sn", +) -> pd.DataFrame: + """Generate new noteStatusHistory by merging in new note labels.""" + if c.useCurrentTimeInsteadOfEpochMillisForNoteStatusHistory: + # When running in prod, we use the latest time possible, so as to include as many valid ratings + # as possible, and be closest to the time the new note statuses are user-visible. + currentTimeMillis = 1000 * time.time() + else: + # When running in test, we use the overridable epochMillis constant. + currentTimeMillis = c.epochMillis def apply_update(mergedNote): return _update_single_note_status_history( diff --git a/sourcecode/scoring/pandas_utils.py b/sourcecode/scoring/pandas_utils.py new file mode 100644 index 00000000..91505ab2 --- /dev/null +++ b/sourcecode/scoring/pandas_utils.py @@ -0,0 +1,670 @@ +"""This module patches Pandas to alert or fail on unexpected dtype conversions. + +The module corrently supports the merge, join and concat operations as these functions +can generate derived dataframes with type conversions. The patch can be configured to +either log to stderr or assert False when an unexpected type conversion is detected. + +This module should support type-related work in the scorer, including: +* Setting all input datatypes to the appropriate np (non-nullable) or pd (nullable) datatype + for the associated input. For example, noteIds should be np.int64, timestamp of first + status should be pd.Int64Dtype, etc. +* Enforcing type expectations on outputs. For example, validating that the participantId + is an int64 and has not been converted to float. +* Fixing unexpected type conversion errors by specifying default values for rows that are + lacking columns during a merge, join or concat. For example, if we generate numRatings + and then join with noteStatusHistory, we should be able to pass fillna={"numRatings": 0} + to "merge" so that the resulting column should still have type np.int64 where missing + values have been filled with 0 (as opposed to cast to a float with missing values set to + np.NaN). +* Add an "allow_unsafe" keyword argument to merge, join and concat that overrides "fail" + and instead logs to stderr. This will allow us to default all current and new code to + enforced safe behavior except for callsites that haven't been fixed yet. +""" + +from collections import Counter +from dataclasses import dataclass +from enum import Enum +import re +import sys +from threading import Lock +import traceback +from typing import Any, Callable, Dict, List, Optional, Set, Tuple + +from . import constants as c + +import numpy as np +import pandas as pd + + +def keep_columns(df: pd.DataFrame, cols: List[str]): + cols = [col for col in cols if col in df] + return df[cols] + + +def get_df_info( + df: pd.DataFrame, name: Optional[str] = None, deep: bool = False, counter: bool = False +) -> str: + """Log dtype and RAM usage stats for each input DataFrame.""" + stats = ( + df.dtypes.to_frame().reset_index(drop=False).rename(columns={"index": "column", 0: "dtype"}) + ).merge( + # deep=True shows memory usage for the entire contained object (e.g. if the type + # of a column is "object", then deep=True shows the size of the objects instead + # of the size of the pointers. + df.memory_usage(index=True, deep=deep) + .to_frame() + .reset_index(drop=False) + .rename(columns={"index": "column", 0: "RAM"}) + ) + ramBytes = stats["RAM"].sum() + if name is not None: + lines = [f"""{name} total RAM: {ramBytes} bytes ({ramBytes * 1e-9:.3f} GB)"""] + else: + lines = [f"""total RAM: {ramBytes} bytes ({ramBytes * 1e-9:.3f} GB)"""] + lines.extend(str(stats).split("\n")) + if counter: + for col, dtype in zip(stats["column"], stats["dtype"]): + if dtype != object: + continue + lines.append(f"{col}: {Counter(type(obj) for obj in df[col])}") + return "\n".join(lines) + + +class TypeErrorCounter(object): + def __init__(self): + self._callCounts: Dict[Tuple[str, str], int] = dict() + self._typeErrors: Dict[Tuple[str, str], Counter[str]] = dict() + self._lock = Lock() + + def log_errors(self, method: str, callsite: str, errors: List[str]) -> None: + key = (method, callsite) + with self._lock: + if key not in self._callCounts: + self._callCounts[key] = 0 + self._callCounts[key] += 1 + if key not in self._typeErrors: + self._typeErrors[key] = Counter() + for error in errors: + self._typeErrors[key][error] += 1 + + def get_summary(self): + lines = [] + keys = [ + (method, -1 * count, callsite) for ((method, callsite), count) in self._callCounts.items() + ] + for method, count, callsite in sorted(keys): + lines.append(f"{method}: {-1 * count} BAD CALLS AT: {callsite.rstrip()}") + for error, errorCount in self._typeErrors[(method, callsite)].items(): + lines.append(f" {errorCount:3d}x {error}") + lines.append("") + return "\n".join(lines) + + +class LogLevel(Enum): + # Raise an error if the expecatation is violated + FATAL = 1 + # Log to stderr when the expectation is violated + ERROR = 2 + # Log to stderr any time the column is observed + INFO = 3 + + +@dataclass +class TypeExpectation: + dtype: type + logLevel: LogLevel + + +class PandasPatcher(object): + def __init__(self, fail: bool, typeOverrides: Dict[str, TypeExpectation] = dict()): + """Initialize a PandasPatcher with particular failure and type expectations. + + Args: + fail: Whether to raise errors or log to stderr when expectations are violated. + expectations: Type expecatations for select columns. + """ + self._fail = fail + self._counter = TypeErrorCounter() + self._origConcat = pd.concat + self._origJoin = pd.DataFrame.join + self._origMerge = pd.DataFrame.merge + self._origApply = pd.DataFrame.apply + self._origInit = pd.DataFrame.__init__ + self._origGetItem = pd.DataFrame.__getitem__ + self._origSetItem = pd.DataFrame.__setitem__ + self._origLocGetItem = pd.core.indexing._LocationIndexer.__getitem__ + self._origLocSetItem = pd.core.indexing._LocationIndexer.__setitem__ + self._expectations = { + c.noteIdKey: TypeExpectation(np.int64, LogLevel.ERROR), + } + for column, expectation in typeOverrides.items(): + self._expectations[column] = expectation + + def get_summary(self) -> str: + return f"\nTYPE WARNING SUMMARY\n{self._counter.get_summary()}" + + def _log_errors(self, method: str, callsite: str, lines: List[str]) -> None: + if not lines: + return + self._counter.log_errors(method, callsite, lines) + errorLines = "\n".join([f" PandasTypeError: {l}" for l in lines]) + msg = f"\n{method} ERROR(S) AT: {callsite}\n{errorLines}\n" + print(msg, file=sys.stderr) + + def _get_check(self, lines: List[str], kwargs: Dict) -> Callable: + """Return a function which will either assert a condition or append to a list of errors. + + Note that this function does not actually log to stderr, but rather appends to a list so + that all + """ + unsafeAllowed = set() + if "unsafeAllowed" in kwargs: + unsafeAllowedArg = kwargs["unsafeAllowed"] + if isinstance(unsafeAllowedArg, str): + unsafeAllowed = {unsafeAllowedArg} + elif isinstance(unsafeAllowedArg, List): + unsafeAllowed = set(unsafeAllowedArg) + else: + assert isinstance(unsafeAllowedArg, Set) + unsafeAllowed = unsafeAllowedArg + del kwargs["unsafeAllowed"] + + def _check(columns: Any, condition: bool, msg: str): + if isinstance(columns, str): + failDisabled = columns in unsafeAllowed + elif isinstance(columns, List): + failDisabled = all(col in unsafeAllowed for col in columns) + else: + # Note there are multiple circumstances where the type of Columns may not be a str + # or List[str], including when we are concatenating a Series (column name will be + # set to None), when there are mulit-level column names (column name will be a tuple) + # or when Pandas has set column names to a RangeIndex. + failDisabled = False + if self._fail and not failDisabled: + assert condition, msg + elif not condition: + if failDisabled: + lines.append(f"{msg} (allowed)") + else: + lines.append(f"{msg} (UNALLOWED)") + + return _check + + def _get_callsite(self) -> str: + """Return the file, function, line numer and pandas API call on a single line.""" + for line in traceback.format_stack()[::-1]: + path = line.split(",")[0] + if "/pandas_utils.py" in path: + continue + if "/pandas/" in path: + continue + break + # Handle paths resulting from bazel invocation + match = re.match(r'^ File ".*?/site-packages(/.*?)", (.*?), (.*?)\n (.*)\n$', line) + if match: + return f"{match.group(1)}, {match.group(3)}, at {match.group(2)}: {match.group(4)}" + # Handle paths fresulting from pytest invocation + match = re.match(r'^ File ".*?/src/(test|main)/python(/.*?)", (.*?), (.*?)\n (.*)\n$', line) + if match: + return f"{match.group(2)}, {match.group(4)}, at {match.group(3)}: {match.group(5)}" + # Handle other paths (e.g. notebook, public code) + match = re.match(r'^ File "(.*?)", (.*?), (.*?)\n (.*)\n$', line) + if match: + return f"{match.group(1)}, {match.group(3)}, at {match.group(2)}: {match.group(4)}" + else: + stack = "\n\n".join(traceback.format_stack()[::-1]) + print(f"parsing error:\n{stack}", file=sys.stderr) + return "parsing error. callsite unknown." + + def _check_dtype(self, dtype: Any, expected: type) -> bool: + """Return True IFF dtype corresponds to expected. + + Note that for non-nullable columns, dtype may equal type (e.g. np.int64), but for nullable + columns the column type is actually an instance of a pandas dtype (e.g. pd.Int64Dtype) + """ + assert expected != object, "expectation must be more specific than object" + return dtype == expected or isinstance(dtype, expected) + + def _check_name_and_type(self, name: str, dtype: Any) -> List[str]: + """Returns a list of type mismatches if any are found, or raises an error.""" + if name not in self._expectations: + return [] + typeExpectation = self._expectations[name] + msg = f"Type expectation mismatch on {name}: found={dtype} expected={typeExpectation.dtype.__name__}" + match = self._check_dtype(dtype, typeExpectation.dtype) + if typeExpectation.logLevel == LogLevel.INFO: + return ( + [msg] + if not match + else [ + f"Type expectation match on {name}: found={dtype} expected={typeExpectation.dtype.__name__}" + ] + ) + elif typeExpectation.logLevel == LogLevel.ERROR or not self._fail: + return [msg] if not match else [] + else: + assert typeExpectation.logLevel == LogLevel.FATAL + assert self._fail + assert match, msg + return [] + + def _validate_series(self, series: pd.Series) -> List[str]: + assert isinstance(series, pd.Series), f"unexpected type: {type(series)}" + return self._check_name_and_type(series.name, series.dtype) + + def _validate_dataframe(self, df: pd.DataFrame) -> List[str]: + """Returns a list of type mismatches if any are found, or raises an error.""" + assert isinstance(df, pd.DataFrame), f"unexpected type: {type(df)}" + lines = [] + # Check index types + if type(df.index) == pd.MultiIndex: + for name, dtype in df.index.dtypes.to_dict().items(): + lines.extend(self._check_name_and_type(name, dtype)) + elif type(df.index) == pd.RangeIndex or df.index.name is None: + # Index is uninteresting - none was specified by the caller. + pass + else: + lines.extend(self._check_name_and_type(df.index.name, df.index.dtype)) + # Check column types + for name, dtype in df.dtypes.to_dict().items(): + lines.extend(self._check_name_and_type(name, dtype)) + return lines + + def safe_init(self) -> Callable: + """Return a modified __init__ function that checks type expectations.""" + + def _safe_init(*args, **kwargs): + """Wrapper around pd.concat + + Args: + args: non-keyword arguments to pass through to merge. + kwargs: keyword arguments to pass through to merge. + """ + df = args[0] + assert isinstance(df, pd.DataFrame), f"unexpected type: {type(df)}" + retVal = self._origInit(*args, **kwargs) + assert retVal is None + lines = self._validate_dataframe(df) + self._log_errors("INIT", self._get_callsite(), lines) + return retVal + + return _safe_init + + def safe_concat(self) -> Callable: + """Return a modified concat function that checks type stability.""" + + def _safe_concat(*args, **kwargs): + """Wrapper around pd.concat + + Args: + args: non-keyword arguments to pass through to merge. + kwargs: keyword arguments to pass through to merge. + """ + lines = [] + check = self._get_check(lines, kwargs) + # Validate that all objects being concatenated are either Series or DataFrames + objs = args[0] + assert type(objs) == list, f"expected first argument to be a list: type={type(objs)}" + assert ( + all(type(obj) == pd.Series for obj in objs) + or all(type(obj) == pd.DataFrame for obj in objs) + ), f"Expected concat args to be either pd.Series or pd.DataFrame: {[type(obj) for obj in objs]}" + if type(objs[0]) == pd.Series: + if "axis" in kwargs and kwargs["axis"] == 1: + # Since the call is concatenating Series as columns in a DataFrame, validate that the sequence + # of Series dtypes matches the sequence of column dtypes in the dataframe. + result = self._origConcat(*args, **kwargs) + objDtypes = [obj.dtype for obj in objs] + assert len(objDtypes) == len( + result.dtypes + ), f"dtype length mismatch: {len(objDtypes)} vs {len(result.dtypes)}" + for col, seriesType, colType in zip(result.columns, objDtypes, result.dtypes): + check( + col, + seriesType == colType, + f"Series concat on {col}: {seriesType} vs {colType}", + ) + else: + # If Series, validate that all series were same type and return + seriesTypes = set(obj.dtype for obj in objs) + check(None, len(seriesTypes) == 1, f"More than 1 unique Series type: {seriesTypes}") + result = self._origConcat(*args, **kwargs) + else: + # If DataFrame, validate that all input columns with matching names have the same type + # and build expectation for output column types + assert type(objs[0]) == pd.DataFrame + # Validate all inputs + for dfArg in objs: + lines.extend(self._validate_dataframe(dfArg)) + colTypes: Dict[str, List[type]] = dict() + for df in objs: + for col, dtype in df.reset_index(drop=False).dtypes.items(): + if col not in colTypes: + colTypes[col] = [] + colTypes[col].append(dtype) + # Perform concatenation and validate that there weren't any type changes + result = self._origConcat(*args, **kwargs) + for col, outputType in result.reset_index(drop=False).dtypes.items(): + check( + col, + all(inputType == outputType for inputType in colTypes[col]), + f"DataFrame concat on {col}: output={outputType} inputs={colTypes[col]}", + ) + if isinstance(result, pd.DataFrame): + lines.extend(self._validate_dataframe(result)) + elif isinstance(result, pd.Series): + lines.extend(self._validate_series(result)) + self._log_errors("CONCAT", self._get_callsite(), lines) + return result + + return _safe_concat + + def safe_apply(self) -> Callable: + """Return a modified apply function that checks type stability.""" + + def _safe_apply(*args, **kwargs): + """Wrapper around pd.DataFrame.apply + + Args: + args: non-keyword arguments to pass through to merge. + kwargs: keyword arguments to pass through to merge. + """ + # TODO: Flesh this out with additional expectatoins around input and output types + result = self._origApply(*args, **kwargs) + if isinstance(result, pd.DataFrame): + self._log_errors("APPLY", self._get_callsite(), self._validate_dataframe(result)) + elif isinstance(result, pd.Series): + self._log_errors("APPLY", self._get_callsite(), self._validate_series(result)) + return result + + return _safe_apply + + def safe_merge(self) -> Callable: + """Return a modified merge function that checks type stability.""" + + def _safe_merge(*args, **kwargs): + """Wrapper around pd.DataFrame.merge. + + Args: + args: non-keyword arguments to pass through to merge. + kwargs: keyword arguments to pass through to merge. + """ + lines = [] + check = self._get_check(lines, kwargs) + leftFrame = args[0] + rightFrame = args[1] + # Validate that argument types are as expected + assert type(leftFrame) is pd.DataFrame + assert type(rightFrame) is pd.DataFrame + lines.extend(self._validate_dataframe(leftFrame)) + lines.extend(self._validate_dataframe(rightFrame)) + # Store dtypes and validate that any common columns have the same type + leftDtypes = dict(leftFrame.reset_index(drop=False).dtypes) + rightDtypes = dict(rightFrame.reset_index(drop=False).dtypes) + for col in set(leftDtypes) & set(rightDtypes): + check( + col, + leftDtypes[col] == rightDtypes[col], + f"Input mismatch on {col}: left={leftDtypes[col]} vs right={rightDtypes[col]}", + ) + # Identify the columns we are merging on, if left_on and right_on are unset + if "on" in kwargs and type(kwargs["on"]) == str: + onCols = set([kwargs["on"]]) + elif "on" in kwargs and type(kwargs["on"]) == list: + onCols = set(kwargs["on"]) + elif "left_on" in kwargs: + assert "on" not in kwargs, "not expecting both on and left_on" + assert "right_on" in kwargs, "expecting both left_on and right_on to be set" + onCols = set() + else: + assert "on" not in kwargs, f"""unexpected type for on: {type(kwargs["on"])}""" + onCols = set(leftFrame.columns) & set(rightFrame.columns) + # Validate that merge columns have matching types + if "left_on" in kwargs: + assert "right_on" in kwargs + left_on = kwargs["left_on"] + right_on = kwargs["right_on"] + check( + [left_on, right_on], + leftDtypes[left_on] == rightDtypes[right_on], + f"Merge key mismatch on type({left_on})={leftDtypes[left_on]} vs type({right_on})={rightDtypes[right_on]}", + ) + else: + assert len(onCols), "expected onCols to be defined since left_on was not" + assert "right_on" not in kwargs, "did not expect onCols and right_on" + for col in onCols: + check( + col, + leftDtypes[col] == rightDtypes[col], + f"Merge key mismatch on {col}: left={leftDtypes[col]} vs right={rightDtypes[col]}", + ) + # Compute expected column types + leftSuffix, rightSuffix = kwargs.get("suffixes", ("_x", "_y")) + commonCols = set(leftFrame.columns) & set(rightFrame.columns) + expectedColTypes = dict() + for col in set(leftFrame.columns) | set(rightFrame.columns): + if col in onCols: + # Note that we check above whether leftDtypes[col] == rightDtypes[col] and either raise an + # error or log as appropriate if there is a mismatch. + if leftDtypes[col] == rightDtypes[col]: + expectedColTypes[col] = leftDtypes[col] + else: + # Set expectation to None since we don't know what will happen, but do want to log an + # error later + expectedColTypes[col] = None + elif col in commonCols: + expectedColTypes[f"{col}{leftSuffix}"] = leftDtypes[col] + expectedColTypes[f"{col}{rightSuffix}"] = rightDtypes[col] + elif col in leftDtypes: + assert col not in rightDtypes + expectedColTypes[col] = leftDtypes[col] + else: + expectedColTypes[col] = rightDtypes[col] + # Perform merge and validate results + result = self._origMerge(*args, **kwargs) + resultDtypes = dict(result.dtypes) + for col in resultDtypes: + check( + col, + resultDtypes[col] == expectedColTypes[col], + f"Output mismatch on {col}: result={resultDtypes[col]} expected={expectedColTypes[col]}", + ) + lines.extend(self._validate_dataframe(result)) + self._log_errors("MERGE", self._get_callsite(), lines) + return result + + return _safe_merge + + def safe_join(self) -> Callable: + """Return a modified merge function that checks type stability.""" + + def _safe_join(*args, **kwargs): + """Wrapper around pd.DataFrame.merge. + + Args: + args: non-keyword arguments to pass through to merge. + kwargs: keyword arguments to pass through to merge. + """ + lines = [] + check = self._get_check(lines, kwargs) + leftFrame = args[0] + rightFrame = args[1] + # Validate arguments are as expected + assert type(leftFrame) is pd.DataFrame + assert type(rightFrame) is pd.DataFrame + lines.extend(self._validate_dataframe(leftFrame)) + lines.extend(self._validate_dataframe(rightFrame)) + assert len(set(kwargs) - {"lsuffix", "rsuffix", "how"}) == 0, f"unexpected kwargs: {kwargs}" + # Validate the assumption that columns used as the join key in the index have the same type. + # This is analogous to validating that onCols match and have the same types in _safe_merge. + if len(leftFrame.index.names) == 1 and len(rightFrame.index.names) == 1: + match = leftFrame.index.dtype == rightFrame.index.dtype + elif len(leftFrame.index.names) == 1 and len(rightFrame.index.names) > 1: + indexTypes = dict(rightFrame.index.dtypes) + name = leftFrame.index.names[0] + assert name in indexTypes, f"{name} not found in {indexTypes}" + match = indexTypes[name] == leftFrame.index.dtype + elif len(leftFrame.index.names) > 1 and len(rightFrame.index.names) == 1: + indexTypes = dict(leftFrame.index.dtypes) + name = rightFrame.index.names[0] + assert name in indexTypes, f"{name} not found in {indexTypes}" + match = indexTypes[name] == rightFrame.index.dtype + else: + assert ( + len(leftFrame.index.names) > 1 + ), f"unexpected left: {type(leftFrame.index)}, {leftFrame.index}" + assert ( + len(rightFrame.index.names) > 1 + ), f"unexpected right: {type(rightFrame.index)}, {rightFrame.index}" + leftIndexTypes = dict(leftFrame.index.dtypes) + rightIndexTypes = dict(rightFrame.index.dtypes) + match = True + for col in set(leftIndexTypes) & set(rightIndexTypes): + match = match & (leftIndexTypes[col] == rightIndexTypes[col]) + check( + list(set(leftFrame.index.names) | set(rightFrame.index.names)), + match, + "Join index mismatch:\nleft:\n{left}\nvs\nright:\n{right}".format( + left=leftFrame.index.dtype if len(leftFrame.index.names) == 1 else leftFrame.index.dtypes, + right=rightFrame.index.dtype + if len(rightFrame.index.names) == 1 + else rightFrame.index.dtypes, + ), + ) + # Validate that input columns with the same name have the same types + leftDtypes = dict(leftFrame.dtypes) + rightDtypes = dict(rightFrame.dtypes) + for col in set(leftDtypes) & set(rightDtypes): + check( + col, + leftDtypes[col] == rightDtypes[col], + f"Input mismatch on {col}: left={leftDtypes[col]} vs right={rightDtypes[col]}", + ) + # Validate that none of the columns in an index have the same name as a non-index column + # in the opposite dataframe + assert ( + len(set(leftFrame.index.names) & set(rightFrame.columns)) == 0 + ), f"left index: {set(leftFrame.index.names)}; right columns {set(rightFrame.columns)}" + assert ( + len(set(rightFrame.index.names) & set(leftFrame.columns)) == 0 + ), f"right index: {set(rightFrame.index.names)}; left columns {set(leftFrame.columns)}" + # Compute expected types for output columns + commonCols = set(leftFrame.columns) & set(rightFrame.columns) + expectedColTypes = dict() + leftSuffix = kwargs.get("lsuffix", "") + rightSuffix = kwargs.get("rsuffix", "") + for col in set(leftFrame.columns) | set(rightFrame.columns): + if col in commonCols: + expectedColTypes[f"{col}{leftSuffix}"] = leftDtypes[col] + expectedColTypes[f"{col}{rightSuffix}"] = rightDtypes[col] + elif col in leftDtypes: + assert col not in rightDtypes + expectedColTypes[col] = leftDtypes[col] + else: + expectedColTypes[col] = rightDtypes[col] + # Compute expected types for index columns + leftIndexCols = set(leftFrame.index.names) + rightIndexCols = set(rightFrame.index.names) + if len(leftIndexCols) > 1: + leftDtypes = dict(leftFrame.index.dtypes) + else: + leftDtypes = {leftFrame.index.name: rightFrame.index.dtype} + if len(rightIndexCols) > 1: + rightDtypes = dict(rightFrame.index.dtypes) + else: + rightDtypes = {rightFrame.index.name: rightFrame.index.dtype} + for col in leftIndexCols & rightIndexCols: + # For columns in both indices, type should not change if input types agree. If input types + # disagree, then we have no expectation. + if leftDtypes[col] == rightDtypes[col]: + expectedColTypes[col] = leftDtypes[col] + else: + expectedColTypes[col] = None + for col in (leftIndexCols | rightIndexCols) - (leftIndexCols & rightIndexCols): + # For columns in exactly one index, the expected output type should match the input column type + # and the column name should not change because we have validated that the column does not + # appear in the other dataframe + if col in leftDtypes: + assert col not in rightDtypes, f"unexpected column: {col}" + expectedColTypes[col] = leftDtypes[col] + else: + expectedColTypes[col] = rightDtypes[col] + # Perform join and validate results. Note that we already validated that the indices had the + # same columns and types, and that the "on" argument is unset, so now we only need to check + # the non-index columns. + result = self._origJoin(*args, **kwargs) + # Note that we must reset index to force any NaNs in the index to emerge as float types. + # See example below. + # left = pd.DataFrame({"idx0": [1, 2], "idx1": [11, 12], "val1": [4, 5]}).set_index(["idx0", "idx1"]) + # right = pd.DataFrame({"idx0": [1, 2, 3], "idx2": [21, 22, 23], "val2": [7, 8, 9]}).set_index(["idx0", "idx2"]) + # print(dict(left.join(right, how="outer").index.dtypes)) + # print(dict(left.join(right, how="outer").reset_index(drop=False).dtypes)) + # $> {'idx0': dtype('int64'), 'idx1': dtype('int64'), 'idx2': dtype('int64')} + # $> {'idx0': dtype('int64'), 'idx1': dtype('float64'), 'idx2': dtype('int64'), 'val1': dtype('float64'), 'val2': dtype('int64')} + resultDtypes = dict(result.reset_index(drop=False).dtypes) + # Add default type for index + if "index" not in expectedColTypes: + expectedColTypes["index"] = np.int64 + for col, dtype in resultDtypes.items(): + if len(col) == 2 and col[1] == "": + col = col[0] + check( + col, + dtype == expectedColTypes[col], + f"Output mismatch on {col}: result={dtype} expected={expectedColTypes[col]}", + ) + lines.extend(self._validate_dataframe(result)) + self._log_errors("JOIN", self._get_callsite(), lines) + return result + + return _safe_join + + +# TODO: restore original functionality before return +# TODO: make enforce_types an explicit arguemnt so this is less error prone +def patch_pandas(main: Callable) -> Callable: + """Return a decorator for wrapping main with pandas patching and logging + + Args: + main: "main" function for program binary + """ + + def _inner(*args, **kwargs) -> Any: + """Determine patching behavior, apply patch and add logging.""" + print("Patching pandas") + if "args" in kwargs: + # Handle birdwatch/scoring/src/main/python/public/scoring/runner.py, which expects + # args as a keyword argument and not as a positional argument. + assert len(args) == 0, f"positional arguments not expected, but found {len(args)}" + clArgs = kwargs["args"] + else: + # Handle the following, which expect args as the second positional argument: + # birdwatch/scoring/src/main/python/run_prescoring.py + # birdwatch/scoring/src/main/python/run_final_scoring.py + # birdwatch/scoring/src/main/python/run_contributor_scoring.py + # birdwatch/scoring/src/main/python/run.py + assert len(args) == 1, f"unexpected 1 positional args, but found {len(args)}" + assert len(kwargs) == 0, f"expected kwargs to be empty, but found {len(kwargs)}" + clArgs = args[0] + # Apply patches, configured based on whether types should be enforced or logged + patcher = PandasPatcher(clArgs.enforce_types) + pd.concat = patcher.safe_concat() + # Note that this will work when calling df1.merge(df2) because the first argument + # to "merge" is df1 (i.e. self). + pd.DataFrame.merge = patcher.safe_merge() + pd.DataFrame.join = patcher.safe_join() + pd.DataFrame.apply = patcher.safe_apply() + pd.DataFrame.__init__ = patcher.safe_init() + # Run main + retVal = main(*args, **kwargs) + # Log type error summary + if hasattr(clArgs, "parallel") and not clArgs.parallel: + print(patcher.get_summary(), file=sys.stderr) + else: + # Don't show type summary because counters will be inaccurate due to scorers running + # in their own process. + print("Type summary omitted when running in parallel.", file=sys.stderr) + # Return result of main + return retVal + + return _inner diff --git a/sourcecode/scoring/post_selection_similarity.py b/sourcecode/scoring/post_selection_similarity.py new file mode 100644 index 00000000..45c1eaf2 --- /dev/null +++ b/sourcecode/scoring/post_selection_similarity.py @@ -0,0 +1,297 @@ +import gc +import logging +import sys +from typing import Dict + +from . import constants as c + +import numpy as np +import pandas as pd + + +logger = logging.getLogger("birdwatch.post_selection_similarity") +logger.setLevel(logging.INFO) + + +class PostSelectionSimilarity: + def __init__( + self, + notes: pd.DataFrame, + ratings: pd.DataFrame, + pmiRegularization: int = 500, + smoothedNpmiThreshold: float = 0.55, + minimumRatingProportionThreshold: float = 0.4, + minUniquePosts: int = 10, + minSimPseudocounts: int = 10, + windowMillis: int = 1000 * 60 * 20, + ): + self.ratings = _preprocess_ratings(notes, ratings) + with c.time_block("Compute pair counts dict"): + self.pairCountsDict = _get_pair_counts_dict(self.ratings, windowMillis=windowMillis) + + self.uniqueRatingsOnTweets = self.ratings[ + [c.tweetIdKey, c.raterParticipantIdKey] + ].drop_duplicates() + raterTotals = self.uniqueRatingsOnTweets[c.raterParticipantIdKey].value_counts() + raterTotalsDict = { + index: value for index, value in raterTotals.items() if value >= minUniquePosts + } + + self.pairCountsDict = _join_rater_totals_compute_pmi_and_filter_edges_below_threshold( + pairCountsDict=self.pairCountsDict, + raterTotalsDict=raterTotalsDict, + N=len(self.uniqueRatingsOnTweets), + pmiPseudocounts=pmiRegularization, + minSimPseudocounts=minSimPseudocounts, + smoothedNpmiThreshold=smoothedNpmiThreshold, + minimumRatingProportionThreshold=minimumRatingProportionThreshold, + ) + + def get_high_post_selection_similarity_raters(self): + uniqueRaters = set() + for r1, r2 in self.pairCountsDict.keys(): + uniqueRaters.add(r1) + uniqueRaters.add(r2) + highPostSelectionSimilarityRaters = pd.DataFrame( + list(uniqueRaters), columns=[c.raterParticipantIdKey] + ) + highPostSelectionSimilarityRaters[c.postSelectionValueKey] = 1 + return highPostSelectionSimilarityRaters + + def get_post_selection_similarity_values(self): + """ + Returns dataframe with [raterParticipantId, postSelectionSimilarityValue] columns. + postSelectionSimilarityValue is None by default. + """ + cliqueToUserMap, userToCliqueMap = aggregate_into_cliques(self.pairCountsDict) + + # Convert dict to pandas dataframe + cliquesDfList = [] + for cliqueId in cliqueToUserMap.keys(): + for userId in cliqueToUserMap[cliqueId]: + cliquesDfList.append({c.raterParticipantIdKey: userId, c.postSelectionValueKey: cliqueId}) + cliquesDf = pd.DataFrame( + cliquesDfList, columns=[c.raterParticipantIdKey, c.postSelectionValueKey] + ) + return cliquesDf + + +def filter_ratings_by_post_selection_similarity(notes, ratings, postSelectionSimilarityValues): + """ + Filters out ratings after the first on each note from raters who have high post selection similarity, + or filters all if the note is authored by a user with the same post selection similarity value. + """ + ratingsWithPostSelectionSimilarity = ( + ratings.merge( + postSelectionSimilarityValues, + on=c.raterParticipantIdKey, + how="left", + unsafeAllowed=c.postSelectionValueKey, + ) + .merge(notes[[c.noteIdKey, c.noteAuthorParticipantIdKey]], on=c.noteIdKey, how="left") + .merge( + postSelectionSimilarityValues, + left_on=c.noteAuthorParticipantIdKey, + right_on=c.raterParticipantIdKey, + how="left", + suffixes=("", "_note_author"), + unsafeAllowed={c.postSelectionValueKey, c.postSelectionValueKey + "_note_author"}, + ) + ) + ratingsWithNoPostSelectionSimilarityValue = ratingsWithPostSelectionSimilarity[ + pd.isna(ratingsWithPostSelectionSimilarity[c.postSelectionValueKey]) + ] + ratingsWithPostSelectionSimilarityValue = ratingsWithPostSelectionSimilarity[ + (~pd.isna(ratingsWithPostSelectionSimilarity[c.postSelectionValueKey])) + & ( + ratingsWithPostSelectionSimilarity[c.postSelectionValueKey] + != ratingsWithPostSelectionSimilarity[c.postSelectionValueKey + "_note_author"] + ) + ] + ratingsWithPostSelectionSimilarityValue.sort_values( + by=[c.noteIdKey, c.createdAtMillisKey], ascending=True, inplace=True + ) + ratingsWithPostSelectionSimilarityValue.drop_duplicates( + subset=[c.noteIdKey, c.postSelectionValueKey], keep="first", inplace=True + ) + + if len(notes) < c.minNumNotesForProdData: + return ratings + + ratings = pd.concat( + [ratingsWithPostSelectionSimilarityValue, ratingsWithNoPostSelectionSimilarityValue], axis=0 + ) + ratings.drop( + columns={c.noteAuthorParticipantIdKey, c.raterParticipantIdKey + "_note_author"}, + errors="ignore", + inplace=True, + ) + return ratings + + +def filter_all_ratings_by_post_selection_similarity(ratings, highPostSelectionSimilarityRaters): + """ + Deprecated. + Filters out all ratings from raters who have high post selection similarity. + """ + ratings = ratings.merge( + highPostSelectionSimilarityRaters, on=c.raterParticipantIdKey, how="left", indicator=True + ) + ratings = ratings[ratings["_merge"] == "left_only"] + ratings = ratings.drop(columns=["_merge"]) + return ratings + + +def _preprocess_ratings(notes: pd.DataFrame, ratings: pd.DataFrame) -> pd.DataFrame: + """ + Preprocess ratings dataframe. + """ + ratings = notes[[c.noteIdKey, c.tweetIdKey]].merge( + ratings[[c.raterParticipantIdKey, c.noteIdKey, c.createdAtMillisKey]], + on=c.noteIdKey, + how="inner", + ) + ratings = ratings[(ratings[c.tweetIdKey] != -1) & (ratings[c.tweetIdKey] != "-1")] + return ratings + + +def _join_rater_totals_compute_pmi_and_filter_edges_below_threshold( + pairCountsDict: Dict, + raterTotalsDict: Dict, + N: int, + pmiPseudocounts: int, + minSimPseudocounts: int, + smoothedNpmiThreshold: float, + minimumRatingProportionThreshold: float, +): + keys_to_delete = [] + + with c.time_block("Compute PMI and minSim"): + for leftRaterId, rightRaterId in pairCountsDict: + if leftRaterId not in raterTotalsDict or rightRaterId not in raterTotalsDict: + keys_to_delete.append((leftRaterId, rightRaterId)) + continue + + leftTotal = raterTotalsDict[leftRaterId] + rightTotal = raterTotalsDict[rightRaterId] + coRatings = pairCountsDict[(leftRaterId, rightRaterId)] + + if type(coRatings) != int: + # already processed (should only occur when re-running...) + continue + + # PMI + pmiNumerator = coRatings * N + pmiDenominator = (leftTotal + pmiPseudocounts) * (rightTotal + pmiPseudocounts) + smoothedPmi = np.log(pmiNumerator / pmiDenominator) + smoothedNpmi = smoothedPmi / -np.log(coRatings / N) + + # minSim + minTotal = min(leftTotal, rightTotal) + minSimRatingProp = coRatings / (minTotal + minSimPseudocounts) + + if (smoothedNpmi >= smoothedNpmiThreshold) or ( + minSimRatingProp >= minimumRatingProportionThreshold + ): + pairCountsDict[(leftRaterId, rightRaterId)] = (smoothedNpmi, minSimRatingProp) + else: + keys_to_delete.append((leftRaterId, rightRaterId)) + + print(f"Pairs dict used {sys.getsizeof(pairCountsDict) * 1e-9}GB RAM at max") + + with c.time_block("Delete unneeded pairs from pairCountsDict"): + for key in keys_to_delete: + del pairCountsDict[key] + + print( + f"Pairs dict used {sys.getsizeof(pairCountsDict) * 1e-9}GB RAM after deleted unneeded pairs" + ) + + return pairCountsDict + + +def aggregate_into_cliques(pairCountsDict): + with c.time_block("Aggregate into cliques by post selection similarity"): + userToCliqueMap = dict() + cliqueToUserMap = dict() + + nextNewCliqueId = 1 # start cliqueIdxs from 1 + + for sid, tid in pairCountsDict.keys(): + if sid in userToCliqueMap: + if tid in userToCliqueMap: + # both in map. merge if not same clique + if userToCliqueMap[sid] != userToCliqueMap[tid]: + # merge. assign all member's of target clique to source clique. + # slow way: iterate over all values here. + # fast way: maintain a reverse map of cliqueToUserMap. + sourceDestClique = userToCliqueMap[sid] + oldTargetCliqueToDel = userToCliqueMap[tid] + + for userId in cliqueToUserMap[oldTargetCliqueToDel]: + cliqueToUserMap[sourceDestClique].append(userId) + userToCliqueMap[userId] = sourceDestClique + del cliqueToUserMap[oldTargetCliqueToDel] + gc.collect() + + else: + # source in map; target not. add target to source's clique + sourceClique = userToCliqueMap[sid] + userToCliqueMap[tid] = sourceClique + cliqueToUserMap[sourceClique].append(tid) + elif tid in userToCliqueMap: + # target in map; source not. add source to target's clique + targetClique = userToCliqueMap[tid] + userToCliqueMap[sid] = targetClique + cliqueToUserMap[targetClique].append(sid) + else: + # new clique + userToCliqueMap[sid] = nextNewCliqueId + userToCliqueMap[tid] = nextNewCliqueId + cliqueToUserMap[nextNewCliqueId] = [sid, tid] + nextNewCliqueId += 1 + return cliqueToUserMap, userToCliqueMap + + +def _get_pair_counts_dict(ratings, windowMillis): + pair_counts = dict() + + # Group by tweetIdKey to process each tweet individually + grouped_by_tweet = ratings.groupby(c.tweetIdKey, sort=False) + + for _, tweet_group in grouped_by_tweet: + # Keep track of pairs we've already counted for this tweetId + pairs_counted_in_tweet = set() + + # Group by noteIdKey within the tweet + grouped_by_note = tweet_group.groupby(c.noteIdKey, sort=False) + + for _, note_group in grouped_by_note: + note_group.sort_values(c.createdAtMillisKey, inplace=True) + + # Extract relevant columns as numpy arrays for efficient computation + times = note_group[c.createdAtMillisKey].values + raters = note_group[c.raterParticipantIdKey].values + + n = len(note_group) + window_start = 0 + + for i in range(n): + # Move the window start forward if the time difference exceeds windowMillis + while times[i] - times[window_start] > windowMillis: + window_start += 1 + + # For all indices within the sliding window (excluding the current index) + for j in range(window_start, i): + if raters[i] != raters[j]: + left_rater, right_rater = tuple(sorted((raters[i], raters[j]))) + pair = (left_rater, right_rater) + # Only count this pair once per tweetId + if pair not in pairs_counted_in_tweet: + pairs_counted_in_tweet.add(pair) + # Update the count for this pair + if pair not in pair_counts: + pair_counts[pair] = 0 + pair_counts[pair] += 1 + + return pair_counts diff --git a/sourcecode/scoring/post_selection_similarity_old.py b/sourcecode/scoring/post_selection_similarity_old.py new file mode 100644 index 00000000..f9704f0f --- /dev/null +++ b/sourcecode/scoring/post_selection_similarity_old.py @@ -0,0 +1,536 @@ +import gc +import logging +from typing import Dict + +from . import constants as c + +import numpy as np +import pandas as pd + + +logger = logging.getLogger("birdwatch.post_selection_similarity") +logger.setLevel(logging.INFO) + + +class PostSelectionSimilarity: + def __init__(self): + pass + + def initialize( + self, + notes: pd.DataFrame, + ratings: pd.DataFrame, + pmiRegularization: int = 500, + smoothedNpmiThreshold: float = 0.45, + minimumRatingProportionThreshold: float = 0.4, + minUniquePosts: int = 10, + minSimPseudocounts: int = 10, + windowMillis: int = 1000 * 60 * 20, + ): + self.ratings = _preprocess_ratings(notes, ratings) + self.pairCounts = _get_pair_tuples(self.ratings, windowMillis=windowMillis) + self.pairStatsDf = _tuples_to_df(self.pairCounts) + self.uniqueRatingsOnTweets = self.ratings[ + [c.tweetIdKey, c.raterParticipantIdKey] + ].drop_duplicates() + self.pairStatsDf = _join_rater_totals(self.pairStatsDf, self.uniqueRatingsOnTweets) + self.pmiDf = _compute_pmi( + self.pairStatsDf, len(self.uniqueRatingsOnTweets), pmiRegularization, minSimPseudocounts + ) + + self.filter_edges_below_threshold( + smoothedNpmiThreshold, minimumRatingProportionThreshold, minUniquePosts + ) + + def filter_edges_below_threshold( + self, smoothedNpmiThreshold, minimumRatingProportionThreshold, minUniquePosts + ): + self.graphDf = self.pmiDf[ + (self.pmiDf["smoothedNpmi"] >= smoothedNpmiThreshold) + | ( + (self.pmiDf["minSimRatingProp"] >= minimumRatingProportionThreshold) + & (self.pmiDf["minTotal"] >= minUniquePosts) + ) + ] + + def get_high_post_selection_similarity_raters(self): + highPostSelectionSimilarityRaters = pd.concat( + [ + self.graphDf[["leftRaterId"]].rename(columns={"leftRaterId": c.raterParticipantIdKey}), + self.graphDf[["rightRaterId"]].rename(columns={"rightRaterId": c.raterParticipantIdKey}), + ] + ).drop_duplicates() + highPostSelectionSimilarityRaters[c.postSelectionValueKey] = 1 + return highPostSelectionSimilarityRaters + + def get_post_selection_similarity_values(self): + """ + Returns dataframe with [raterParticipantId, postSelectionSimilarityValue] columns. + postSelectionSimilarityValue is None by default. + """ + cliqueToUserMap, userToCliqueMap = aggregate_into_cliques(self.graphDf) + + # Convert dict to pandas dataframe + cliquesDfList = [] + for cliqueId in cliqueToUserMap.keys(): + for userId in cliqueToUserMap[cliqueId]: + cliquesDfList.append({c.raterParticipantIdKey: userId, c.postSelectionValueKey: cliqueId}) + cliquesDf = pd.DataFrame( + cliquesDfList, columns=[c.raterParticipantIdKey, c.postSelectionValueKey] + ) + return cliquesDf + + +def filter_ratings_by_post_selection_similarity(notes, ratings, postSelectionSimilarityValues): + """ + Filters out ratings after the first on each note from raters who have high post selection similarity, + or filters all if the note is authored by a user with the same post selection similarity value. + """ + ratingsWithPostSelectionSimilarity = ( + ratings.merge( + postSelectionSimilarityValues, + on=c.raterParticipantIdKey, + how="left", + unsafeAllowed=c.postSelectionValueKey, + ) + .merge(notes[[c.noteIdKey, c.noteAuthorParticipantIdKey]], on=c.noteIdKey, how="left") + .merge( + postSelectionSimilarityValues, + left_on=c.noteAuthorParticipantIdKey, + right_on=c.raterParticipantIdKey, + how="left", + suffixes=("", "_note_author"), + unsafeAllowed={c.postSelectionValueKey, c.postSelectionValueKey + "_note_author"}, + ) + ) + ratingsWithNoPostSelectionSimilarityValue = ratingsWithPostSelectionSimilarity[ + pd.isna(ratingsWithPostSelectionSimilarity[c.postSelectionValueKey]) + ] + ratingsWithPostSelectionSimilarityValue = ratingsWithPostSelectionSimilarity[ + (~pd.isna(ratingsWithPostSelectionSimilarity[c.postSelectionValueKey])) + & ( + ratingsWithPostSelectionSimilarity[c.postSelectionValueKey] + != ratingsWithPostSelectionSimilarity[c.postSelectionValueKey + "_note_author"] + ) + ] + ratingsWithPostSelectionSimilarityValue.sort_values( + by=[c.noteIdKey, c.createdAtMillisKey], ascending=True, inplace=True + ) + ratingsWithPostSelectionSimilarityValue.drop_duplicates( + subset=[c.noteIdKey, c.postSelectionValueKey], keep="first", inplace=True + ) + + ratings = pd.concat( + [ratingsWithPostSelectionSimilarityValue, ratingsWithNoPostSelectionSimilarityValue], axis=0 + ) + ratings.drop( + columns={c.noteAuthorParticipantIdKey, c.raterParticipantIdKey + "_note_author"}, + errors="ignore", + inplace=True, + ) + return ratings + + +def filter_all_ratings_by_post_selection_similarity(ratings, highPostSelectionSimilarityRaters): + """ + Deprecated. + Filters out all ratings from raters who have high post selection similarity. + """ + ratings = ratings.merge( + highPostSelectionSimilarityRaters, on=c.raterParticipantIdKey, how="left", indicator=True + ) + ratings = ratings[ratings["_merge"] == "left_only"] + ratings = ratings.drop(columns=["_merge"]) + return ratings + + +def _compute_pmi( + pairStatsDf: pd.DataFrame, N: int, pmiPseudocounts: int = 500, minSimPseudocounts: int = 10 +) -> pd.DataFrame: + """ + Compute PMI between raters. + """ + numerator = pairStatsDf["pairRatings"] * N + denominator = (pairStatsDf["leftTotal"] + pmiPseudocounts) * ( + pairStatsDf["rightTotal"] + pmiPseudocounts + ) + pairStatsDf["smoothedPmi"] = np.log(numerator / denominator) + pairStatsDf["smoothedNpmi"] = pairStatsDf["smoothedPmi"] / -np.log(pairStatsDf["pairRatings"] / N) + pairStatsDf["minTotal"] = np.minimum(pairStatsDf["leftTotal"], pairStatsDf["rightTotal"]) + pairStatsDf["minSimRatingProp"] = pairStatsDf["pairRatings"] / ( + pairStatsDf["minTotal"] + minSimPseudocounts + ) + return pairStatsDf + + +def _preprocess_ratings(notes: pd.DataFrame, ratings: pd.DataFrame) -> pd.DataFrame: + """ + Preprocess ratings dataframe. + """ + ratings = notes[[c.noteIdKey, c.tweetIdKey]].merge( + ratings[[c.raterParticipantIdKey, c.noteIdKey, c.createdAtMillisKey]], + on=c.noteIdKey, + how="inner", + ) + ratings = ratings[(ratings[c.tweetIdKey] != -1) & (ratings[c.tweetIdKey] != "-1")] + return ratings + + +def _join_rater_totals( + pairStatsDf: pd.DataFrame, uniqueRatingsOnTweets: pd.DataFrame, minRatings: int = 10 +): + raterTotals = uniqueRatingsOnTweets[c.raterParticipantIdKey].value_counts().reset_index() + raterTotals.columns = [c.raterParticipantIdKey, "count"] + raterTotals = raterTotals[raterTotals["count"] >= minRatings] + pairStatsDf = pairStatsDf.merge( + raterTotals.rename(columns={c.raterParticipantIdKey: "leftRaterId", "count": "leftTotal"}) + ) + pairStatsDf = pairStatsDf.merge( + raterTotals.rename(columns={c.raterParticipantIdKey: "rightRaterId", "count": "rightTotal"}) + ) + return pairStatsDf + + +def aggregate_into_cliques(graphDf): + with c.time_block("Aggregate into cliques by post selection similarity"): + userToCliqueMap = dict() + cliqueToUserMap = dict() + + nextNewCliqueId = 1 # start cliqueIdxs from 1 + for i, row in graphDf.iterrows(): + sid = row["leftRaterId"] + tid = row["rightRaterId"] + if sid in userToCliqueMap: + if tid in userToCliqueMap: + # both in map. merge if not same clique + if userToCliqueMap[sid] != userToCliqueMap[tid]: + # merge. assign all member's of target clique to source clique. + # slow way: iterate over all values here. + # fast way: maintain a reverse map of cliqueToUserMap. + sourceDestClique = userToCliqueMap[sid] + oldTargetCliqueToDel = userToCliqueMap[tid] + + for userId in cliqueToUserMap[oldTargetCliqueToDel]: + cliqueToUserMap[sourceDestClique].append(userId) + userToCliqueMap[userId] = sourceDestClique + del cliqueToUserMap[oldTargetCliqueToDel] + gc.collect() + + else: + # source in map; target not. add target to source's clique + sourceClique = userToCliqueMap[sid] + userToCliqueMap[tid] = sourceClique + cliqueToUserMap[sourceClique].append(tid) + elif tid in userToCliqueMap: + # target in map; source not. add source to target's clique + targetClique = userToCliqueMap[tid] + userToCliqueMap[sid] = targetClique + cliqueToUserMap[targetClique].append(sid) + else: + # new clique + userToCliqueMap[sid] = nextNewCliqueId + userToCliqueMap[tid] = nextNewCliqueId + cliqueToUserMap[nextNewCliqueId] = [sid, tid] + nextNewCliqueId += 1 + return cliqueToUserMap, userToCliqueMap + + +def _make_rater_stats_df(pairCounts): + with c.time_block("Making rater stats dataframe from pair counts dict"): + leftRater, rightRater, pairRatings = [], [], [] + for i, ((left, right), count) in enumerate(pairCounts.items()): + leftRater.append(left) + rightRater.append(right) + pairRatings.append(count) + return pd.DataFrame( + { + "leftRaterId": np.array(leftRater), + "rightRaterId": np.array(rightRater), + "pairRatings": np.array(pairRatings), + } + ) + +def _get_pair_counts_df_dict(ratings, windowMillis): + import numpy as np + import pandas as pd + from collections import defaultdict + + # Assign column keys to local variables for faster access + noteIdKey = c.noteIdKey + createdAtMillisKey = c.createdAtMillisKey + raterParticipantIdKey = c.raterParticipantIdKey + + # Sort ratings by noteIdKey and createdAtMillisKey + ratings_sorted = ratings.sort_values([noteIdKey, createdAtMillisKey]) + + # Initialize a defaultdict to store counts of pairs + pair_counts = defaultdict(int) + + # Group by noteIdKey to process each note individually + grouped = ratings_sorted.groupby(noteIdKey, sort=False) + + for noteId, group in grouped: + # Extract relevant columns as numpy arrays for efficient computation + times = group[createdAtMillisKey].values + raters = group[raterParticipantIdKey].values + + n = len(group) + window_start = 0 + + for i in range(n): + # Move the window start forward if the time difference exceeds windowMillis + while times[i] - times[window_start] > windowMillis: + window_start += 1 + + # For all indices within the sliding window (excluding the current index) + for j in range(window_start, i): + if raters[i] != raters[j]: + left_rater, right_rater = tuple(sorted((raters[i], raters[j]))) + # Update the count for this pair + pair_counts[(left_rater, right_rater)] += 1 + + # Convert the pair_counts dictionary to a DataFrame + if pair_counts: + pairs = np.array(list(pair_counts.keys())) + counts = np.array(list(pair_counts.values())) + df = pd.DataFrame({ + 'leftRaterId': pairs[:, 0], + 'rightRaterId': pairs[:, 1], + 'pairRatings': counts + }) + else: + # Return an empty DataFrame with appropriate columns + df = pd.DataFrame(columns=['leftRaterId', 'rightRaterId', 'pairRatings']) + + return df + + +def _get_pair_ratings_df_optimized(ratings, windowMillis): + + # Assign column keys to local variables for faster access + noteIdKey = c.noteIdKey + createdAtMillisKey = c.createdAtMillisKey + raterParticipantIdKey = c.raterParticipantIdKey + tweetIdKey = c.tweetIdKey + + # Sort ratings by noteIdKey and createdAtMillisKey + ratings_sorted = ratings.sort_values([noteIdKey, createdAtMillisKey]) + + # Initialize lists to collect data + left_raters = [] + right_raters = [] + tweet_ids = [] + + # Group by noteIdKey to process each note individually + grouped = ratings_sorted.groupby(noteIdKey, sort=False) + + for noteId, group in grouped: + # Extract relevant columns as numpy arrays for efficient computation + times = group[createdAtMillisKey].values + raters = group[raterParticipantIdKey].values + tweetId = group[tweetIdKey].iloc[0] # Assuming tweetIdKey is constant within a note + + n = len(group) + window_start = 0 + + for i in range(n): + # Move the window start forward if the time difference exceeds windowMillis + while times[i] - times[window_start] > windowMillis: + window_start += 1 + + # For all indices within the sliding window (excluding the current index) + for j in range(window_start, i): + if raters[i] != raters[j]: + left_rater, right_rater = tuple(sorted((raters[i], raters[j]))) + left_raters.append(left_rater) + right_raters.append(right_rater) + tweet_ids.append(tweetId) + + # Convert lists to numpy arrays for efficient DataFrame creation + left_raters = np.array(left_raters) + right_raters = np.array(right_raters) + tweet_ids = np.array(tweet_ids) + + # Create the DataFrame from the collected data + df = pd.DataFrame({ + 'leftRaterId': left_raters, + 'rightRaterId': right_raters, + 'tweetId': tweet_ids, + }) + + # Drop duplicates + df = df.drop_duplicates() + + # Group by leftRaterId and rightRaterId and count the number of occurrences + df = ( + df.groupby(['leftRaterId', 'rightRaterId'], as_index=False) + .agg(pairRatings=('tweetId', 'count')) + ) + return df + + +# get number of ratings per pair in same time window +def _get_pair_tuples(ratings, windowMillis): + tuples = [] + ratings = ratings.sort_values([c.noteIdKey, c.createdAtMillisKey]) + values = ratings[ + [c.noteIdKey, c.createdAtMillisKey, c.raterParticipantIdKey, c.tweetIdKey] + ].values + print(len(values)) + for i in range(len(values)): + priorNote, priorTs, priorRater, priorTweet = values[i] + if i == 0 or i == 1000 or i == 100000 or i % 5000000 == 0: + print(f"i={i} len(tuples)={len(tuples)}") + j = i + 1 + while j < len(values): + nextNote, nextTs, nextRater, nextTweet = values[j] + assert priorNote <= nextNote, (priorNote, nextNote) + if nextNote != priorNote: + break # break if we're onto a new note + assert priorTweet == nextTweet, (priorTweet, nextTweet) # tweet should be same + assert priorRater != nextRater, (priorRater, nextRater) # rater should be different + assert priorTs <= nextTs, (priorTs, nextTs) + if nextTs > (priorTs + windowMillis): + break # break if we're beyond the overlap window + leftRater, rigthRater = tuple(sorted((priorRater, nextRater))) + tuples.append((leftRater, rigthRater, priorTweet)) + j += 1 + return tuples + +def _get_pair_tuples_optimized(ratings, windowMillis): + + # Sort ratings by noteIdKey and createdAtMillisKey + ratings_sorted = ratings.sort_values([c.noteIdKey, c.createdAtMillisKey]) + + # Initialize an empty list to store the result + tuples = [] + + # Group by noteIdKey to process each note individually + grouped = ratings_sorted.groupby(c.noteIdKey, sort=False) + + for noteId, group in grouped: + # Extract relevant columns as numpy arrays for efficient computation + times = group[c.createdAtMillisKey].values + raters = group[c.raterParticipantIdKey].values + priorTweet = group[c.tweetIdKey].iloc[0] + + n = len(group) + window_start = 0 # Start index of the sliding window + + for i in range(n): + # Move the window start forward if the time difference exceeds windowMillis + while times[i] - times[window_start] > windowMillis: + window_start += 1 + + # For all indices within the sliding window (excluding the current index) + for j in range(window_start, i): + # Check if raters are different + if raters[i] != raters[j]: + # Sort raters to maintain consistency + leftRater, rightRater = tuple(sorted((raters[i], raters[j]))) + tuples.append((leftRater, rightRater, priorTweet)) + + return tuples + + +import multiprocessing as mp + +def _get_pair_tuples_parallel(ratings, windowMillis): + # Sort and group ratings + ratings_sorted = ratings.sort_values([c.noteIdKey, c.createdAtMillisKey]) + grouped = ratings_sorted.groupby(c.noteIdKey, sort=False) + + # Prepare arguments for parallel processing + args = [(group, windowMillis) for _, group in grouped] + + # Use multiprocessing Pool + with mp.Pool(mp.cpu_count()) as pool: + results = pool.starmap(_get_pair_tuples_process_group, args) + + # Flatten the list of results + tuples = [tup for sublist in results for tup in sublist] + return tuples + +def _get_pair_tuples_process_group(group, windowMillis): + # Same logic as before, applied to a single group + times = group[c.createdAtMillisKey].values + raters = group[c.raterParticipantIdKey].values + priorTweet = group[c.tweetIdKey].iloc[0] + + n = len(group) + window_start = 0 + tuples = [] + + for i in range(n): + while times[i] - times[window_start] > windowMillis: + window_start += 1 + for j in range(window_start, i): + if raters[i] != raters[j]: + leftRater, rightRater = tuple(sorted((raters[i], raters[j]))) + tuples.append((leftRater, rightRater, priorTweet)) + return tuples + + + +def _tuples_to_df(tuples, name="pairRatings"): + leftRater, rightRater, tweetId = zip(*tuples) + df = pd.DataFrame( + { + "leftRaterId": np.array(leftRater), + "rightRaterId": np.array(rightRater), + "tweetId": np.array(tweetId), + } + ) + print(len(df)) + df = df.drop_duplicates() + print(len(df)) + df = ( + df.groupby(["leftRaterId", "rightRaterId"]) + .count() + .reset_index(drop=False) + .rename(columns={"tweetId": name}) + ) + print(len(df)) + return df + + +def _get_pair_counts(ratings: pd.DataFrame, windowMillis: int = 1000 * 60 * 30) -> Dict: + """ + Compute counts of unique posts that were co-rated within windowMillis millis of each other + by different users. + + Returns dict: (raterId1, raterId2) => count. + """ + with c.time_block("Computing rating pair counts"): + counts = dict() + seen = set() + ratings = ratings.sort_values([c.noteIdKey, c.createdAtMillisKey]) + values = ratings[ + [c.noteIdKey, c.createdAtMillisKey, c.raterParticipantIdKey, c.tweetIdKey] + ].values + logger.info(len(values)) + for i in range(len(values)): + priorNote, priorTs, priorRater, priorTweet = values[i] + if i == 0 or i == 1000 or i == 100000 or i % 5000000 == 0: + logger.info(f"get_pair_counts i={i}") + j = i + 1 + while j < len(values): + nextNote, nextTs, nextRater, nextTweet = values[j] + assert priorNote <= nextNote, (priorNote, nextNote) + if nextNote != priorNote: + break # break if we're onto a new note + assert priorTweet == nextTweet, (priorTweet, nextTweet) # tweet should be same + assert priorRater != nextRater, (priorRater, nextRater) # rater should be different + assert priorTs <= nextTs, (priorTs, nextTs) + if nextTs > (priorTs + windowMillis): + break # break if we're beyond windowMillis + raterPairKey = tuple(sorted((priorRater, nextRater))) + raterTweetPairKey = (raterPairKey, priorTweet) + if raterTweetPairKey in seen: + break # break if we already counted a match on this tweet + seen.add(raterTweetPairKey) + if raterPairKey not in counts: + counts[raterPairKey] = 0 + counts[raterPairKey] += 1 + j += 1 + return counts diff --git a/sourcecode/scoring/process_data.py b/sourcecode/scoring/process_data.py index d7876846..7a560dd9 100644 --- a/sourcecode/scoring/process_data.py +++ b/sourcecode/scoring/process_data.py @@ -1,12 +1,20 @@ from abc import ABC, abstractmethod from io import StringIO +import logging import os from typing import Dict, List, Optional, Tuple from . import constants as c, note_status_history +from .pandas_utils import get_df_info +import joblib import numpy as np import pandas as pd +from sklearn.pipeline import Pipeline + + +logger = logging.getLogger("birdwatch.process_data") +logger.setLevel(logging.INFO) def read_from_strings( @@ -39,7 +47,13 @@ def read_from_strings( def tsv_parser( - rawTSV: str, mapping: Dict[str, type], columns: List[str], header: bool + rawTSV: str, + mapping: Dict[str, type], + columns: List[str], + header: bool, + useCols: Optional[List[str]] = None, + chunkSize: Optional[int] = None, + convertNAToNone: bool = True, ) -> pd.DataFrame: """Parse a TSV input and raise an Exception if the input is not formatted as expected. @@ -48,6 +62,8 @@ def tsv_parser( mapping: Dict mapping column names to types columns: List of column names header: bool indicating whether the input will have a header + useCols: Optional list of columns to return + chunkSize: Optional number of rows to read at a time when returning a subset of columns Returns: pd.DataFrame containing parsed data @@ -58,36 +74,78 @@ def tsv_parser( if num_fields != len(columns): raise ValueError(f"Expected {len(columns)} columns, but got {num_fields}") - data = pd.read_csv( - StringIO(rawTSV), - sep="\t", - names=columns, - dtype=mapping, - header=0 if header else None, - index_col=[], - ) + if useCols and chunkSize: + textParser = pd.read_csv( + StringIO(rawTSV), + sep="\t", + names=columns, + dtype=mapping, + header=0 if header else None, + index_col=[], + usecols=useCols, + chunksize=chunkSize, + ) + data = pd.concat(textParser, ignore_index=True) + else: + data = pd.read_csv( + StringIO(rawTSV), + sep="\t", + names=columns, + dtype=mapping, + header=0 if header else None, + index_col=[], + usecols=useCols, + ) + if convertNAToNone: + logger.info("Logging size effect of convertNAToNone") + logger.info("Before conversion:") + logger.info(get_df_info(data)) + # float types will be nan if missing; newer nullable types like "StringDtype" or "Int64Dtype" will by default + # be pandas._libs.missing.NAType if missing. Set those to None and change the dtype back to object. + for colname, coltype in mapping.items(): + # check if coltype is pd.BooleanDtype + if coltype in set( + [pd.StringDtype(), pd.BooleanDtype(), pd.Int64Dtype(), pd.Int32Dtype(), "boolean"] + ): + data[colname] = data[colname].astype(object) + data.loc[pd.isna(data[colname]), colname] = None + logger.info("After conversion:") + logger.info(get_df_info(data)) return data except (ValueError, IndexError) as e: raise ValueError(f"Invalid input: {e}") -def tsv_reader_single(path: str, mapping, columns, header=False, parser=tsv_parser): +def tsv_reader_single( + path: str, mapping, columns, header=False, parser=tsv_parser, convertNAToNone=True +): """Read a single TSV file.""" with open(path, "r", encoding="utf-8") as handle: - return tsv_parser(handle.read(), mapping, columns, header) + return tsv_parser(handle.read(), mapping, columns, header, convertNAToNone=convertNAToNone) -def tsv_reader(path: str, mapping, columns, header=False, parser=tsv_parser): +def tsv_reader( + path: str, mapping, columns, header=False, parser=tsv_parser, convertNAToNone=True +) -> pd.DataFrame: """Read a single TSV file or a directory of TSV files.""" if os.path.isdir(path): dfs = [ - tsv_reader_single(os.path.join(path, filename), mapping, columns, header, parser) + tsv_reader_single( + os.path.join(path, filename), + mapping, + columns, + header, + parser, + convertNAToNone=convertNAToNone, + ) for filename in os.listdir(path) if filename.endswith(".tsv") ] return pd.concat(dfs, ignore_index=True) else: - return tsv_reader_single(path, mapping, columns, header, parser) + return tsv_reader_single( + path, mapping, columns, header, parser, convertNAToNone=convertNAToNone + ) def read_from_tsv( @@ -111,7 +169,9 @@ def read_from_tsv( if notesPath is None: notes = None else: - notes = tsv_reader(notesPath, c.noteTSVTypeMapping, c.noteTSVColumns, header=headers) + notes = tsv_reader( + notesPath, c.noteTSVTypeMapping, c.noteTSVColumns, header=headers, convertNAToNone=False + ) assert len(notes.columns) == len(c.noteTSVColumns) and all(notes.columns == c.noteTSVColumns), ( f"note columns don't match: \n{[col for col in notes.columns if not col in c.noteTSVColumns]} are extra columns, " + f"\n{[col for col in c.noteTSVColumns if not col in notes.columns]} are missing." @@ -120,7 +180,9 @@ def read_from_tsv( if ratingsPath is None: ratings = None else: - ratings = tsv_reader(ratingsPath, c.ratingTSVTypeMapping, c.ratingTSVColumns, header=headers) + ratings = tsv_reader( + ratingsPath, c.ratingTSVTypeMapping, c.ratingTSVColumns, header=headers, convertNAToNone=False + ) assert len(ratings.columns.values) == len(c.ratingTSVColumns) and all( ratings.columns == c.ratingTSVColumns ), ( @@ -131,18 +193,36 @@ def read_from_tsv( if noteStatusHistoryPath is None: noteStatusHistory = None else: - noteStatusHistory = tsv_reader( - noteStatusHistoryPath, - c.noteStatusHistoryTSVTypeMapping, - c.noteStatusHistoryTSVColumns, - header=headers, - ) - assert len(noteStatusHistory.columns.values) == len(c.noteStatusHistoryTSVColumns) and all( - noteStatusHistory.columns == c.noteStatusHistoryTSVColumns - ), ( - f"noteStatusHistory columns don't match: \n{[col for col in noteStatusHistory.columns if not col in c.noteStatusHistoryTSVColumns]} are extra columns, " - + f"\n{[col for col in c.noteStatusHistoryTSVColumns if not col in noteStatusHistory.columns]} are missing." - ) + # TODO(jiansongc): clean up after new column is in production. + try: + noteStatusHistory = tsv_reader( + noteStatusHistoryPath, + c.noteStatusHistoryTSVTypeMapping, + c.noteStatusHistoryTSVColumns, + header=headers, + convertNAToNone=False, + ) + assert len(noteStatusHistory.columns.values) == len(c.noteStatusHistoryTSVColumns) and all( + noteStatusHistory.columns == c.noteStatusHistoryTSVColumns + ), ( + f"noteStatusHistory columns don't match: \n{[col for col in noteStatusHistory.columns if not col in c.noteStatusHistoryTSVColumns]} are extra columns, " + + f"\n{[col for col in c.noteStatusHistoryTSVColumns if not col in noteStatusHistory.columns]} are missing." + ) + except ValueError: + noteStatusHistory = tsv_reader( + noteStatusHistoryPath, + c.noteStatusHistoryTSVTypeMappingOld, + c.noteStatusHistoryTSVColumnsOld, + header=headers, + convertNAToNone=False, + ) + noteStatusHistory[c.timestampMillisOfFirstNmrDueToMinStableCrhTimeKey] = np.nan + assert len(noteStatusHistory.columns.values) == len(c.noteStatusHistoryTSVColumns) and all( + noteStatusHistory.columns == c.noteStatusHistoryTSVColumns + ), ( + f"noteStatusHistory columns don't match: \n{[col for col in noteStatusHistory.columns if not col in c.noteStatusHistoryTSVColumns]} are extra columns, " + + f"\n{[col for col in c.noteStatusHistoryTSVColumns if not col in noteStatusHistory.columns]} are missing." + ) if userEnrollmentPath is None: userEnrollment = None @@ -152,6 +232,7 @@ def read_from_tsv( c.userEnrollmentTSVTypeMapping, c.userEnrollmentTSVColumns, header=headers, + convertNAToNone=False, ) assert len(userEnrollment.columns.values) == len(c.userEnrollmentTSVColumns) and all( userEnrollment.columns == c.userEnrollmentTSVColumns @@ -167,7 +248,7 @@ def _filter_misleading_notes( notes: pd.DataFrame, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, - logging: bool = True, + log: bool = True, ) -> pd.DataFrame: """ This function actually filters ratings (not notes), based on which notes they rate. @@ -180,7 +261,7 @@ def _filter_misleading_notes( notes (pd.DataFrame): _description_ ratings (pd.DataFrame): _description_ noteStatusHistory (pd.DataFrame): _description_ - logging (bool, optional): _description_. Defaults to True. + log (bool, optional): _description_. Defaults to True. Returns: pd.DataFrame: filtered ratings @@ -190,6 +271,7 @@ def _filter_misleading_notes( on=c.noteIdKey, how="left", suffixes=("", "_nsh"), + unsafeAllowed=c.createdAtMillisKey, ) deletedNoteKey = "deletedNote" @@ -213,20 +295,20 @@ def _filter_misleading_notes( ratings[c.classificationKey] == c.noteSaysTweetIsNotMisleadingKey ) & (ratings[createdAtMillisNSHKey] > c.notMisleadingUILaunchTime) - if logging: - print( + if log: + logger.info( f"Preprocess Data: Filter misleading notes, starting with {len(ratings)} ratings on {len(np.unique(ratings[c.noteIdKey]))} notes" ) - print( + logger.info( f" Keeping {ratings[notDeletedMisleadingKey].sum()} ratings on {len(np.unique(ratings.loc[ratings[notDeletedMisleadingKey],c.noteIdKey]))} misleading notes" ) - print( + logger.info( f" Keeping {ratings[deletedButInNSHKey].sum()} ratings on {len(np.unique(ratings.loc[ratings[deletedButInNSHKey],c.noteIdKey]))} deleted notes that were previously scored (in note status history)" ) - print( + logger.info( f" Removing {notDeletedNotMisleadingOldUI.sum()} ratings on {len(np.unique(ratings.loc[notDeletedNotMisleadingOldUI, c.noteIdKey]))} older notes that aren't deleted, but are not-misleading." ) - print( + logger.info( f" Removing {deletedNotInNSH.sum()} ratings on {len(np.unique(ratings.loc[deletedNotInNSH, c.noteIdKey]))} notes that were deleted and not in note status history (e.g. old)." ) @@ -285,13 +367,29 @@ def remove_duplicate_notes(notes: pd.DataFrame) -> pd.DataFrame: return notes +def compute_helpful_num(ratings: pd.DataFrame): + """ + Populate the "helpfulNum" column. + not helpful: 0.0 + somewhat helpful: 0.5 + helpful: 1.0 + """ + ratings.loc[:, c.helpfulNumKey] = np.nan + ratings.loc[ratings[c.helpfulKey] == 1, c.helpfulNumKey] = 1 + ratings.loc[ratings[c.notHelpfulKey] == 1, c.helpfulNumKey] = 0 + ratings.loc[ratings[c.helpfulnessLevelKey] == c.notHelpfulValueTsv, c.helpfulNumKey] = 0 + ratings.loc[ratings[c.helpfulnessLevelKey] == c.somewhatHelpfulValueTsv, c.helpfulNumKey] = 0.5 + ratings.loc[ratings[c.helpfulnessLevelKey] == c.helpfulValueTsv, c.helpfulNumKey] = 1 + ratings = ratings.loc[~pd.isna(ratings[c.helpfulNumKey])] + return ratings def preprocess_data( notes: pd.DataFrame, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, shouldFilterNotMisleadingNotes: bool = True, - logging: bool = True, + log: bool = True, + ratingsOnly: bool = False, ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: """Populate helpfulNumKey, a unified column that merges the helpfulness answers from the V1 and V2 rating forms together, as described in @@ -304,42 +402,40 @@ def preprocess_data( ratings (pd.DataFrame) noteStatusHistory (pd.DataFrame) shouldFilterNotMisleadingNotes (bool, optional): Defaults to True. - logging (bool, optional): Defaults to True. + log (bool, optional): Defaults to True. + ratingsOnly (bool, optional): Defaults to False Returns: notes (pd.DataFrame) ratings (pd.DataFrame) noteStatusHistory (pd.DataFrame) """ - if logging: - print( - "Timestamp of latest rating in data: ", - pd.to_datetime(ratings[c.createdAtMillisKey], unit="ms").max(), - ) - print( - "Timestamp of latest note in data: ", - pd.to_datetime(notes[c.createdAtMillisKey], unit="ms").max(), + if log: + logger.info( + f"Timestamp of latest rating in data: {pd.to_datetime(ratings[c.createdAtMillisKey], unit='ms').max()}", ) + if not ratingsOnly: + logger.info( + f"Timestamp of latest note in data: {pd.to_datetime(notes[c.createdAtMillisKey], unit='ms').max()}", + ) + ratings = remove_duplicate_ratings(ratings) - notes = remove_duplicate_notes(notes) + ratings = compute_helpful_num(ratings) - ratings.loc[:, c.helpfulNumKey] = np.nan - ratings.loc[ratings[c.helpfulKey] == 1, c.helpfulNumKey] = 1 - ratings.loc[ratings[c.notHelpfulKey] == 1, c.helpfulNumKey] = 0 - ratings.loc[ratings[c.helpfulnessLevelKey] == c.notHelpfulValueTsv, c.helpfulNumKey] = 0 - ratings.loc[ratings[c.helpfulnessLevelKey] == c.somewhatHelpfulValueTsv, c.helpfulNumKey] = 0.5 - ratings.loc[ratings[c.helpfulnessLevelKey] == c.helpfulValueTsv, c.helpfulNumKey] = 1 - ratings = ratings.loc[~pd.isna(ratings[c.helpfulNumKey])] + if ratingsOnly: + return pd.DataFrame(), ratings, pd.DataFrame() + + notes = remove_duplicate_notes(notes) notes[c.tweetIdKey] = notes[c.tweetIdKey].astype(str) noteStatusHistory = note_status_history.merge_note_info(noteStatusHistory, notes) if shouldFilterNotMisleadingNotes: - ratings = _filter_misleading_notes(notes, ratings, noteStatusHistory, logging) + ratings = _filter_misleading_notes(notes, ratings, noteStatusHistory, log) - if logging: - print( + if log: + logger.info( "Num Ratings: %d, Num Unique Notes Rated: %d, Num Unique Raters: %d" % ( len(ratings), @@ -354,7 +450,7 @@ def filter_ratings( ratings: pd.DataFrame, minNumRatingsPerRater: int, minNumRatersPerNote: int, - logging: bool = True, + log: bool = True, ) -> pd.DataFrame: """Apply min number of ratings for raters & notes. Instead of iterating these filters until convergence, simply stop after going back and force once. @@ -365,7 +461,7 @@ def filter_ratings( included in scoring. Raters with fewer ratings are removed. minNumRatersPerNote: Minimum number of ratings which a note must have to be included in scoring. Notes with fewer ratings are removed. - logging: Debug output. Defaults to True. + log: Debug output. Defaults to True. Returns: pd.DataFrame: filtered ratings @@ -385,11 +481,11 @@ def filter_raters(ratings): ratings = filter_raters(ratings) ratings = filter_notes(ratings) - if logging: + if log: # Log final details unique_notes = ratings[c.noteIdKey].nunique() unique_raters = ratings[c.raterParticipantIdKey].nunique() - print( + logger.info( f"After applying min {minNumRatingsPerRater} ratings per rater and min {minNumRatersPerNote} raters per note: \n" + f"Num Ratings: {len(ratings)}, Num Unique Notes Rated: {unique_notes}, Num Unique Raters: {unique_raters}" ) @@ -400,19 +496,32 @@ def filter_raters(ratings): def write_prescoring_output( prescoringNoteModelOutput: pd.DataFrame, prescoringRaterModelOutput: pd.DataFrame, + noteTopicClassifier: Pipeline, + prescoringMetaOutput: c.PrescoringMetaOutput, + prescoringScoredNotesOutput: Optional[pd.DataFrame], noteModelOutputPath: str, raterModelOutputPath: str, + noteTopicClassifierPath: str, + prescoringMetaOutputPath: str, + prescoringScoredNotesOutputPath: Optional[str], + headers: bool = True, ): prescoringNoteModelOutput = prescoringNoteModelOutput[c.prescoringNoteModelOutputTSVColumns] assert all(prescoringNoteModelOutput.columns == c.prescoringNoteModelOutputTSVColumns) - write_tsv_local(prescoringNoteModelOutput, noteModelOutputPath) + write_tsv_local(prescoringNoteModelOutput, noteModelOutputPath, headers=headers) prescoringRaterModelOutput = prescoringRaterModelOutput[c.prescoringRaterModelOutputTSVColumns] assert all(prescoringRaterModelOutput.columns == c.prescoringRaterModelOutputTSVColumns) - write_tsv_local(prescoringRaterModelOutput, raterModelOutputPath) + write_tsv_local(prescoringRaterModelOutput, raterModelOutputPath, headers=headers) + if prescoringScoredNotesOutput is not None and prescoringScoredNotesOutputPath is not None: + write_tsv_local(prescoringScoredNotesOutput, prescoringScoredNotesOutputPath, headers=headers) -def write_tsv_local(df: pd.DataFrame, path: str) -> None: + joblib.dump(noteTopicClassifier, noteTopicClassifierPath) + joblib.dump(prescoringMetaOutput, prescoringMetaOutputPath) + + +def write_tsv_local(df: pd.DataFrame, path: str, headers: bool = True) -> None: """Write DF as a TSV stored to local disk. Note that index=False (so the index column will not be written to disk), and header=True @@ -424,7 +533,7 @@ def write_tsv_local(df: pd.DataFrame, path: str) -> None: """ assert path is not None - assert df.to_csv(path, index=False, header=True, sep="\t") is None + assert df.to_csv(path, index=False, header=headers, sep="\t") is None def write_parquet_local( @@ -476,9 +585,11 @@ def __init__( userEnrollmentPath: str, headers: bool, shouldFilterNotMisleadingNotes: bool = True, - logging: bool = True, + log: bool = True, prescoringNoteModelOutputPath: Optional[str] = None, prescoringRaterModelOutputPath: Optional[str] = None, + prescoringNoteTopicClassifierPath: Optional[str] = None, + prescoringMetaOutputPath: Optional[str] = None, ) -> None: """ Args: @@ -488,7 +599,7 @@ def __init__( userEnrollmentPath (str): file path headers: If true, expect first row of input files to be headers. shouldFilterNotMisleadingNotes (bool, optional): Throw out not-misleading notes if True. Defaults to True. - logging (bool, optional): Print out debug output. Defaults to True. + log (bool, optional): Print out debug output. Defaults to True. """ self.notesPath = notesPath self.ratingsPath = ratingsPath @@ -496,9 +607,11 @@ def __init__( self.userEnrollmentPath = userEnrollmentPath self.prescoringNoteModelOutputPath = prescoringNoteModelOutputPath self.prescoringRaterModelOutputPath = prescoringRaterModelOutputPath + self.prescoringNoteTopicClassifierPath = prescoringNoteTopicClassifierPath + self.prescoringMetaOutputPath = prescoringMetaOutputPath self.headers = headers self.shouldFilterNotMisleadingNotes = shouldFilterNotMisleadingNotes - self.logging = logging + self.log = log def get_data(self) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]: """All-in-one function for reading Birdwatch notes and ratings from TSV files. @@ -515,13 +628,15 @@ def get_data(self) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFra self.headers, ) notes, ratings, noteStatusHistory = preprocess_data( - notes, ratings, noteStatusHistory, self.shouldFilterNotMisleadingNotes, self.logging + notes, ratings, noteStatusHistory, self.shouldFilterNotMisleadingNotes, self.log ) return notes, ratings, noteStatusHistory, userEnrollment - def get_prescoring_model_output(self) -> Tuple[pd.DataFrame, pd.DataFrame]: - print( - f"Attempting to read prescoring model output from {self.prescoringNoteModelOutputPath} and {self.prescoringRaterModelOutputPath}" + def get_prescoring_model_output( + self, + ) -> Tuple[pd.DataFrame, pd.DataFrame, Pipeline, c.PrescoringMetaOutput]: + logger.info( + f"Attempting to read prescoring model output from {self.prescoringNoteModelOutputPath}, {self.prescoringRaterModelOutputPath}, {self.prescoringNoteTopicClassifierPath}, {self.prescoringMetaOutputPath}" ) if self.prescoringRaterModelOutputPath is None: prescoringRaterModelOutput = None @@ -530,7 +645,7 @@ def get_prescoring_model_output(self) -> Tuple[pd.DataFrame, pd.DataFrame]: self.prescoringRaterModelOutputPath, c.prescoringRaterModelOutputTSVTypeMapping, c.prescoringRaterModelOutputTSVColumns, - header=True, + header=self.headers, ) assert len(prescoringRaterModelOutput.columns) == len( c.prescoringRaterModelOutputTSVColumns @@ -546,7 +661,7 @@ def get_prescoring_model_output(self) -> Tuple[pd.DataFrame, pd.DataFrame]: self.prescoringNoteModelOutputPath, c.prescoringNoteModelOutputTSVTypeMapping, c.prescoringNoteModelOutputTSVColumns, - header=True, + header=self.headers, ) assert len(prescoringNoteModelOutput.columns) == len( c.prescoringNoteModelOutputTSVColumns @@ -555,4 +670,159 @@ def get_prescoring_model_output(self) -> Tuple[pd.DataFrame, pd.DataFrame]: + f"\n{[col for col in c.prescoringNoteModelOutputTSVColumns if not col in prescoringNoteModelOutput.columns]} are missing." ) # ensure constants file is up to date. - return prescoringNoteModelOutput, prescoringRaterModelOutput + if self.prescoringNoteTopicClassifierPath is None: + prescoringNoteTopicClassifier = None + else: + prescoringNoteTopicClassifier = joblib.load(self.prescoringNoteTopicClassifierPath) + assert type(prescoringNoteTopicClassifier) == Pipeline + + if self.prescoringMetaOutputPath is None: + prescoringMetaOutput = None + else: + prescoringMetaOutput = joblib.load(self.prescoringMetaOutputPath) + assert type(prescoringMetaOutput) == c.PrescoringMetaOutput + + return ( + prescoringNoteModelOutput, + prescoringRaterModelOutput, + prescoringNoteTopicClassifier, + prescoringMetaOutput, + ) + + +def filter_input_data_for_testing( + notes: pd.DataFrame, + ratings: pd.DataFrame, + noteStatusHistory: pd.DataFrame, + cutoffTimestampMillis: Optional[int] = None, + excludeRatingsAfterANoteGotFirstStatusPlusNHours: Optional[int] = None, + daysInPastToApplyPostFirstStatusFiltering: Optional[int] = 14, + filterPrescoringInputToSimulateDelayInHours: Optional[int] = None, +) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]: + """ + Args: + cutoffTimestampMillis: filter all notes and ratings after this time. + + excludeRatingsAfterANoteGotFirstStatusPlusNHours: set to 0 to throw out all + ratings after a note was first CRH. Set to None to turn off. + daysInPastToApplyPostFirstStatusFiltering: only apply the previous + filter to notes created in the last this-many days. + + filterPrescoringInputToSimulateDelayInHours: Optional[int]: for system tests, + simulate final scoring running this many hours after prescoring. + + Returns: notes, ratings, prescoringNotesInput, prescoringRatingsInput + """ + logger.info( + f"""Called filter_input_data_for_testing. + Notes: {len(notes)}, Ratings: {len(ratings)}. Max note createdAt: {pd.to_datetime(notes[c.createdAtMillisKey].max(), unit='ms')}; Max rating createAt: {pd.to_datetime(ratings[c.createdAtMillisKey].max(), unit='ms')}""" + ) + + notes, ratings = filter_notes_and_ratings_after_particular_timestamp_millis( + notes, ratings, cutoffTimestampMillis + ) + logger.info( + f"""After filtering notes and ratings after particular timestamp (={cutoffTimestampMillis}). + Notes: {len(notes)}, Ratings: {len(ratings)}. Max note createdAt: {pd.to_datetime(notes[c.createdAtMillisKey].max(), unit='ms')}; Max rating createAt: {pd.to_datetime(ratings[c.createdAtMillisKey].max(), unit='ms')}""" + ) + + ratings = filter_ratings_after_first_status_plus_n_hours( + ratings, + noteStatusHistory, + excludeRatingsAfterANoteGotFirstStatusPlusNHours, + daysInPastToApplyPostFirstStatusFiltering, + ) + logger.info( + f"""After filtering ratings after first status (plus {excludeRatingsAfterANoteGotFirstStatusPlusNHours} hours) for notes created in last {daysInPastToApplyPostFirstStatusFiltering} days. + Notes: {len(notes)}, Ratings: {len(ratings)}. Max note createdAt: {pd.to_datetime(notes[c.createdAtMillisKey].max(), unit='ms')}; Max rating createAt: {pd.to_datetime(ratings[c.createdAtMillisKey].max(), unit='ms')}""" + ) + + ( + prescoringNotesInput, + prescoringRatingsInput, + ) = filter_prescoring_input_to_simulate_delay_in_hours( + notes, ratings, filterPrescoringInputToSimulateDelayInHours + ) + logger.info( + f"""After filtering prescoring notes and ratings to simulate a delay of {filterPrescoringInputToSimulateDelayInHours} hours: + Notes: {len(prescoringNotesInput)}, Ratings: {len(prescoringRatingsInput)}. Max note createdAt: {pd.to_datetime(prescoringNotesInput[c.createdAtMillisKey].max(), unit='ms')}; Max rating createAt: {pd.to_datetime(prescoringRatingsInput[c.createdAtMillisKey].max(), unit='ms')}""" + ) + + return notes, ratings, prescoringNotesInput, prescoringRatingsInput + + +def filter_ratings_after_first_status_plus_n_hours( + ratings: pd.DataFrame, + noteStatusHistory: pd.DataFrame, + excludeRatingsAfterANoteGotFirstStatusPlusNHours: Optional[int] = None, + daysInPastToApplyPostFirstStatusFiltering: Optional[int] = 14, +) -> pd.DataFrame: + if excludeRatingsAfterANoteGotFirstStatusPlusNHours is None: + return ratings + + if daysInPastToApplyPostFirstStatusFiltering is None: + daysInPastToApplyPostFirstStatusFiltering = 14 + + ratingCutoffTimeMillisKey = "ratingCutoffTimeMillis" + + # First: determine out which notes to apply this to (created in past + # daysInPastToApplyPostFirstStatusFiltering days) + millisToLookBack = daysInPastToApplyPostFirstStatusFiltering * 24 * 60 * 60 * 1000 + cutoffTimeMillis = noteStatusHistory[c.createdAtMillisKey].max() - millisToLookBack + nshToFilter = noteStatusHistory[noteStatusHistory[c.createdAtMillisKey] > cutoffTimeMillis] + logger.info( + f" Notes to apply the post-first-status filter for (from last {daysInPastToApplyPostFirstStatusFiltering} days): {len(nshToFilter)}" + ) + nshToFilter[ratingCutoffTimeMillisKey] = nshToFilter[ + c.timestampMillisOfNoteFirstNonNMRLabelKey + ] + (excludeRatingsAfterANoteGotFirstStatusPlusNHours * 60 * 60 * 1000) + + # Next: join their firstStatusTime from NSH with their ratings + ratingsWithNSH = ratings.merge( + nshToFilter[[c.noteIdKey, ratingCutoffTimeMillisKey]], on=c.noteIdKey, how="left" + ) + # And then filter out ratings made after that time. Don't filter any ratings for notes with + # nan cutoff time. + ratingsWithNSH[ratingCutoffTimeMillisKey].fillna( + ratingsWithNSH[c.createdAtMillisKey].max() + 1, inplace=True + ) + ratingsWithNSH = ratingsWithNSH[ + ratingsWithNSH[c.createdAtMillisKey] < ratingsWithNSH[ratingCutoffTimeMillisKey] + ] + return ratingsWithNSH.drop(columns=[ratingCutoffTimeMillisKey]) + + +def filter_notes_and_ratings_after_particular_timestamp_millis( + notes: pd.DataFrame, + ratings: pd.DataFrame, + cutoffTimestampMillis: Optional[int], +) -> Tuple[pd.DataFrame, pd.DataFrame]: + if cutoffTimestampMillis is not None: + notes = notes[notes[c.createdAtMillisKey] <= cutoffTimestampMillis].copy() + ratings = ratings[ratings[c.createdAtMillisKey] <= cutoffTimestampMillis].copy() + return notes, ratings + + +def filter_prescoring_input_to_simulate_delay_in_hours( + notes: pd.DataFrame, + ratings: pd.DataFrame, + filterPrescoringInputToSimulateDelayInHours: Optional[int], +) -> Tuple[pd.DataFrame, pd.DataFrame]: + if filterPrescoringInputToSimulateDelayInHours is not None: + latestRatingMillis = ratings[c.createdAtMillisKey].max() + cutoffMillis = latestRatingMillis - ( + filterPrescoringInputToSimulateDelayInHours * 60 * 60 * 1000 + ) + logger.info( + f""" + Filtering input data for prescoring to simulate running prescoring earlier than final scoring. + Latest rating timestamp: {pd.to_datetime(latestRatingMillis, unit='ms')} + Cutoff timestamp: {pd.to_datetime(cutoffMillis, unit='ms')} ({filterPrescoringInputToSimulateDelayInHours} hours before) + """ + ) + prescoringNotesInput = notes[notes[c.createdAtMillisKey] < cutoffMillis].copy() + prescoringRatingsInput = ratings[ratings[c.createdAtMillisKey] < cutoffMillis].copy() + else: + prescoringNotesInput = notes + prescoringRatingsInput = ratings + return prescoringNotesInput, prescoringRatingsInput diff --git a/sourcecode/scoring/reputation_matrix_factorization/dataset.py b/sourcecode/scoring/reputation_matrix_factorization/dataset.py index 1f083251..392b3605 100644 --- a/sourcecode/scoring/reputation_matrix_factorization/dataset.py +++ b/sourcecode/scoring/reputation_matrix_factorization/dataset.py @@ -1,4 +1,5 @@ from dataclasses import dataclass +from typing import Dict from .. import constants as c @@ -14,8 +15,11 @@ class MatrixFactorizationDataset: raterTensor: torch.Tensor targetTensor: torch.Tensor # Ordered notes and raters associated with each index - notes: np.ndarray - raters: np.ndarray + notes: np.ndarray # noteIds # idx -> id + raters: np.ndarray # raterIds # idx -> id + # Maps of id to index + raterIdToIndex: Dict #: Dict[int, int] + noteIdToIndex: Dict #: Dict[int, int] def build_dataset( @@ -32,15 +36,19 @@ def build_dataset( """ # Identify mappings from note and rater IDs to indices notes = ratings[c.noteIdKey].drop_duplicates().sort_values().values - noteIdMap = dict(zip(notes, np.arange(len(notes), dtype=np.int64))) + noteIdToIndex = dict(zip(notes, np.arange(len(notes), dtype=np.int32))) raters = ratings[c.raterParticipantIdKey].drop_duplicates().sort_values().values - raterIdMap = dict(zip(raters, np.arange(len(raters), dtype=np.int64))) + raterIdToIndex = dict(zip(raters, np.arange(len(raters), dtype=np.int32))) # Generate tensors - noteTensor = torch.tensor([noteIdMap[noteId] for noteId in ratings[c.noteIdKey]], device=device) - raterTensor = torch.tensor( - [raterIdMap[raterId] for raterId in ratings[c.raterParticipantIdKey]], device=device + noteTensor = torch.IntTensor( + [noteIdToIndex[noteId] for noteId in ratings[c.noteIdKey]], device=device + ) + raterTensor = torch.IntTensor( + [raterIdToIndex[raterId] for raterId in ratings[c.raterParticipantIdKey]], + device=device, ) targetTensor = torch.tensor(targets, device=device, dtype=torch.float32) + # Return MatrixFactorizationDataset return MatrixFactorizationDataset( noteTensor=noteTensor, @@ -48,4 +56,6 @@ def build_dataset( targetTensor=targetTensor, notes=notes, raters=raters, + raterIdToIndex=raterIdToIndex, + noteIdToIndex=noteIdToIndex, ) diff --git a/sourcecode/scoring/reputation_matrix_factorization/diligence_model.py b/sourcecode/scoring/reputation_matrix_factorization/diligence_model.py index 6ff9ce8a..52f6f247 100644 --- a/sourcecode/scoring/reputation_matrix_factorization/diligence_model.py +++ b/sourcecode/scoring/reputation_matrix_factorization/diligence_model.py @@ -1,19 +1,28 @@ -from typing import Optional +import logging +from typing import Optional, Tuple from .. import constants as c from .dataset import build_dataset -from .reputation_matrix_factorization import ReputationModelHyperparameters, train_model +from .reputation_matrix_factorization import ( + ReputationModelHyperparameters, + train_model_final, + train_model_prescoring, +) import pandas as pd import torch -def get_low_diligence_intercepts( +logger = logging.getLogger("birdwatch.diligence_model") +logger.setLevel(logging.INFO) + + +def _setup_dataset_and_hparams( filteredRatings: pd.DataFrame, - noteInitState: Optional[pd.DataFrame] = None, - raterInitState: Optional[pd.DataFrame] = None, device=torch.device("cpu"), -) -> pd.DataFrame: + ratingsPerNoteLossRatio: Optional[float] = None, + ratingsPerUserLossRatio: Optional[float] = None, +): # Define dataset targets = ( ( @@ -30,7 +39,7 @@ def get_low_diligence_intercepts( # Model hyperparameters activationFunction="IDENTITY", nDim=1, - # Optimizaiton hyperparameters + # Optimization hyperparameters numEpochs=300, logRate=30, learningRate=0.2, @@ -59,22 +68,133 @@ def get_low_diligence_intercepts( raterNormExpThirdRound=0, reputationExp=0.5, alpha=0.1, + defaultReputation=1.0, + ratingPerNoteLossRatio=ratingsPerNoteLossRatio, # 35.0, # approx 29377568 / 795977 + ratingPerUserLossRatio=ratingsPerUserLossRatio, # 75.0, # approx 29377568 / 265214 + ) + return dataset, hParams + + +def _prepare_diligence_init_state(noteInitState, raterInitState): + if noteInitState is not None: + noteInitState = noteInitState[ + [c.noteIdKey] + [col for col in noteInitState.columns if "lowDiligence" in col] + ] + noteInitState.columns = [ + col.replace("lowDiligence", "internal") for col in noteInitState.columns + ] + if raterInitState is not None: + raterInitState = raterInitState[ + [c.raterParticipantIdKey] + [col for col in raterInitState.columns if "lowDiligence" in col] + ] + raterInitState.columns = [ + col.replace("lowDiligence", "internal") for col in raterInitState.columns + ] + return noteInitState, raterInitState + + +def fit_low_diligence_model_final( + filteredRatings: pd.DataFrame, + noteInitStateDiligence: pd.DataFrame, + raterInitStateDiligence: pd.DataFrame, + globalInterceptDiligence: c.ReputationGlobalIntercept, + ratingsPerNoteLossRatio: Optional[float] = None, + ratingsPerUserLossRatio: Optional[float] = None, + device=torch.device("cpu"), +) -> Tuple[pd.DataFrame, pd.DataFrame]: + """ + Args: + filteredRatings: DataFrame containing ratings data + noteInitStateDiligence: DataFrame containing initial state for notes (expects diligence prefixes e.g. lowDiligenceNoteIntercept) + raterInitStateDiligence: DataFrame containing initial state for raters (expects diligence prefixes, not internal prefixes) + globalInterceptDiligence: float + device: torch.device to use for training + """ + dataset, hParams = _setup_dataset_and_hparams( + filteredRatings, device, ratingsPerNoteLossRatio, ratingsPerUserLossRatio + ) + noteInitStateInternal, raterInitStateInternal = _prepare_diligence_init_state( + noteInitStateDiligence, raterInitStateDiligence + ) + + model, loss_final = train_model_final( + hParams=hParams, + dataset=dataset, + noteInitState=noteInitStateInternal, + raterInitState=raterInitStateInternal, + globalInterceptInit=globalInterceptDiligence, + device=device, + ) + logger.info(f"Low diligence final loss: {loss_final:.4f}") + + noteStats = pd.DataFrame( + { + c.noteIdKey: dataset.notes, + c.lowDiligenceNoteInterceptKey: model.noteBias.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceNoteFactor1Key: model.noteEmbedding.weight.cpu().flatten().detach().numpy(), + } + ) + raterStats = pd.DataFrame( + { + c.raterParticipantIdKey: dataset.raters, + c.lowDiligenceRaterInterceptKey: model.raterBias.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceRaterReputationKey: model.raterReputation.weight.cpu() + .flatten() + .detach() + .numpy(), + c.lowDiligenceRaterFactor1Key: model.raterEmbedding.weight.cpu().flatten().detach().numpy(), + } + ) + return noteStats, raterStats + + +def fit_low_diligence_model_prescoring( + filteredRatings: pd.DataFrame, + noteInitStateDiligence: Optional[pd.DataFrame] = None, + raterInitStateDiligence: Optional[pd.DataFrame] = None, + device=torch.device("cpu"), +) -> Tuple[pd.DataFrame, pd.DataFrame, c.ReputationGlobalIntercept]: + dataset, hParams = _setup_dataset_and_hparams(filteredRatings, device) + noteInitStateInternal, raterInitStateInternal = _prepare_diligence_init_state( + noteInitStateDiligence, raterInitStateDiligence ) # Train model - model, loss1, loss2, loss3 = train_model( + ( + model, + loss1, + loss2, + loss3, + globalIntercept, + noteIntercept2, + raterIntercept2, + ) = train_model_prescoring( hParams=hParams, dataset=dataset, - noteInitState=noteInitState, - raterInitState=raterInitState, + noteInitState=noteInitStateInternal, + raterInitState=raterInitStateInternal, device=device, ) - print(f"Low diligence training loss: {loss1:.4f}, {loss2:.4f}, {loss3:.4f}") + logger.info(f"Low diligence training loss: {loss1:.4f}, {loss2:.4f}, {loss3:.4f}") - # Compose and return DataFrame - return pd.DataFrame( + noteStats = pd.DataFrame( { c.noteIdKey: dataset.notes, - c.lowDiligenceInterceptKey: model.noteBias.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceNoteInterceptKey: model.noteBias.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceNoteFactor1Key: model.noteEmbedding.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceNoteInterceptRound2Key: noteIntercept2, + } + ) + raterStats = pd.DataFrame( + { + c.raterParticipantIdKey: dataset.raters, + c.lowDiligenceRaterInterceptKey: model.raterBias.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceRaterReputationKey: model.raterReputation.weight.cpu() + .flatten() + .detach() + .numpy(), + c.lowDiligenceRaterFactor1Key: model.raterEmbedding.weight.cpu().flatten().detach().numpy(), + c.lowDiligenceRaterInterceptRound2Key: raterIntercept2, } ) + return noteStats, raterStats, globalIntercept diff --git a/sourcecode/scoring/reputation_matrix_factorization/helpfulness_model.py b/sourcecode/scoring/reputation_matrix_factorization/helpfulness_model.py index 1f4bc67d..f19f4453 100644 --- a/sourcecode/scoring/reputation_matrix_factorization/helpfulness_model.py +++ b/sourcecode/scoring/reputation_matrix_factorization/helpfulness_model.py @@ -1,19 +1,26 @@ -from typing import Optional +import logging +from typing import Optional, Tuple from .. import constants as c from .dataset import build_dataset -from .reputation_matrix_factorization import ReputationModelHyperparameters, train_model +from .reputation_matrix_factorization import ( + ReputationModelHyperparameters, + train_model_final, + train_model_prescoring, +) import pandas as pd import torch -def get_helpfulness_reputation_results( +logger = logging.getLogger("birdwatch.helpfulness_model") +logger.setLevel(logging.INFO) + + +def _setup_dataset_and_hparams( filteredRatings: pd.DataFrame, - noteInitState: Optional[pd.DataFrame] = None, - raterInitState: Optional[pd.DataFrame] = None, device=torch.device("cpu"), -) -> pd.DataFrame: +): # Define dataset targets = filteredRatings[c.helpfulNumKey].values dataset = build_dataset(filteredRatings, targets, device=device) @@ -22,7 +29,7 @@ def get_helpfulness_reputation_results( # Model hyperparameters activationFunction="IDENTITY", nDim=1, - # Optimizaiton hyperparameters + # Optimization hyperparameters numEpochs=300, logRate=30, learningRate=0.2, @@ -52,16 +59,38 @@ def get_helpfulness_reputation_results( reputationExp=1.0, alpha=0.0, ) + return dataset, hParams + + +def get_helpfulness_reputation_results_final( + filteredRatings: pd.DataFrame, + noteInitState: pd.DataFrame, + raterInitState: pd.DataFrame, + globalIntercept: c.ReputationGlobalIntercept, + device=torch.device("cpu"), +) -> Tuple[pd.DataFrame, pd.DataFrame]: + dataset, hParams = _setup_dataset_and_hparams(filteredRatings, device) + + # Hack: convert "diligenceRound2" to internal round 2, since the diligence fields are used as placeholders + # for the helpfulness-reputation model's internal round 2 score, since this model is not used as a + # prod scorer now and doesn't run its own diligence model. + noteInitState[c.internalNoteInterceptRound2Key] = noteInitState[ + c.lowDiligenceNoteInterceptRound2Key + ] + raterInitState[c.internalRaterInterceptRound2Key] = raterInitState[ + c.lowDiligenceRaterInterceptRound2Key + ] # Train model - model, loss1, loss2, loss3 = train_model( + model, loss = train_model_final( hParams=hParams, dataset=dataset, noteInitState=noteInitState, raterInitState=raterInitState, + globalInterceptInit=globalIntercept, device=device, ) - print(f"Helpfulness reputation loss: {loss1:.4f}, {loss2:.4f}, {loss3:.4f}") + logger.info(f"Helpfulness reputation loss: {loss:.4f}") # Compose and return DataFrames noteStats = pd.DataFrame( @@ -81,3 +110,54 @@ def get_helpfulness_reputation_results( } ) return noteStats, raterStats + + +def get_helpfulness_reputation_results_prescoring( + filteredRatings: pd.DataFrame, + noteInitState: Optional[pd.DataFrame] = None, + raterInitState: Optional[pd.DataFrame] = None, + device=torch.device("cpu"), +) -> Tuple[pd.DataFrame, pd.DataFrame, c.ReputationGlobalIntercept]: + dataset, hParams = _setup_dataset_and_hparams(filteredRatings, device) + + # Train model + ( + model, + loss1, + loss2, + loss3, + globalIntercept, + noteIntercept2, + raterIntercept2, + ) = train_model_prescoring( + hParams=hParams, + dataset=dataset, + noteInitState=noteInitState, + raterInitState=raterInitState, + device=device, + ) + logger.info(f"Helpfulness reputation loss: {loss1:.4f}, {loss2:.4f}, {loss3:.4f}") + + # Compose and return DataFrames + noteStats = pd.DataFrame( + { + c.noteIdKey: dataset.notes, + c.internalNoteInterceptKey: model.noteBias.weight.cpu().flatten().detach().numpy(), + c.internalNoteFactor1Key: model.noteEmbedding.weight.cpu().flatten().detach().numpy(), + # Hack for now: not actually diligence, but it's the 2nd round intercept from helpfulness. + # TODO: make a new top-level field for 2nd round reputation model intercepts in top-level prescoring output. + c.lowDiligenceNoteInterceptRound2Key: noteIntercept2, + } + ) + raterStats = pd.DataFrame( + { + c.raterParticipantIdKey: dataset.raters, + c.internalRaterReputationKey: model.raterReputation.weight.cpu().flatten().detach().numpy(), + c.internalRaterInterceptKey: model.raterBias.weight.cpu().flatten().detach().numpy(), + c.internalRaterFactor1Key: model.raterEmbedding.weight.cpu().flatten().detach().numpy(), + # Hack for now: not actually diligence, but it's the 2nd round intercept from helpfulness. + # TODO: make a new top-level field for 2nd round reputation model intercepts in top-level prescoring output. + c.lowDiligenceRaterInterceptRound2Key: raterIntercept2, + } + ) + return noteStats, raterStats, globalIntercept diff --git a/sourcecode/scoring/reputation_matrix_factorization/reputation_matrix_factorization.py b/sourcecode/scoring/reputation_matrix_factorization/reputation_matrix_factorization.py index ef5526f1..6e9a8617 100644 --- a/sourcecode/scoring/reputation_matrix_factorization/reputation_matrix_factorization.py +++ b/sourcecode/scoring/reputation_matrix_factorization/reputation_matrix_factorization.py @@ -1,4 +1,5 @@ from dataclasses import dataclass +import logging import time from typing import Optional @@ -11,6 +12,10 @@ import torch.nn as nn +logger = logging.getLogger("birdwatch.reputation_matrix_factorization") +logger.setLevel(logging.INFO) + + # Define dataclass to represent learning hyperparameters @dataclass class ReputationModelHyperparameters: @@ -46,6 +51,18 @@ class ReputationModelHyperparameters: raterNormExpThirdRound: float reputationExp: float alpha: float + defaultReputation: float = 1.0 + ratingPerNoteLossRatio: Optional[float] = None + ratingPerUserLossRatio: Optional[float] = None + + +def get_or_default_if_nan(lookupDict, key, default): + if key not in lookupDict: + return default + val = lookupDict.get(key, default) + if np.isnan(val): + return default + return val # Define model with customizable loss, activation, regularization and dimensionality @@ -62,8 +79,20 @@ def __init__( l2RaterReputationMultiplier, noteInitState: Optional[pd.DataFrame] = None, raterInitState: Optional[pd.DataFrame] = None, + globalInterceptInit: Optional[float] = None, device=torch.device("cpu"), + defaultReputation=1.0, + ratingPerNoteLossRatio: Optional[float] = None, + ratingPerUserLossRatio: Optional[float] = None, ): + """ + noteInitState expects a df with columns: + noteId, internalNoteIntercept, internalRaterFactor1 + raterInitState expects a df with columns: + raterParticipantIdKey, internalRaterIntercept, internalRaterFactor1, internalRaterReputation + + For diligence model: may want to map these names back to internal before calling this function. + """ super().__init__() # Save hyperparameters self.activation_fn = activation_fn @@ -80,35 +109,115 @@ def __init__( self.noteBias = nn.Embedding(dataset.notes.shape[0], 1, **self.format) self.raterBias = nn.Embedding(dataset.raters.shape[0], 1, **self.format) self.raterReputation = nn.Embedding(dataset.raters.shape[0], 1, **self.format) - self.globalBias = nn.Parameter(torch.tensor(0.0, **self.format)) - # Initialize rater reputation to 1 self.raterReputation.weight = nn.Parameter( - torch.ones(self.raterReputation.weight.shape[0], 1, **self.format) + torch.ones(self.raterReputation.weight.shape[0], 1, **self.format) * defaultReputation ) - if raterInitState is not None: - mapping = dict(raterInitState[[c.raterParticipantIdKey, c.internalRaterFactor1Key]].values) - print("Initializing raters:") - print(f" num raters: {dataset.raters.shape[0]}") - self.raterEmbedding.weight = nn.Parameter( - torch.tensor([mapping.get(rater, 0.0) for rater in dataset.raters]) - .to(torch.float32) - .reshape(-1, 1) - .to(device) + self._ratingPerNoteLossRatio = ratingPerNoteLossRatio + self._ratingPerUserLossRatio = ratingPerUserLossRatio + + self.init_global_bias(globalInterceptInit) + self.init_rater_factor(raterInitState, dataset, device, defaultValue=0.0) + self.init_rater_intercept(raterInitState, dataset, device, defaultValue=0.0) + self.init_rater_reputation(raterInitState, dataset, device, defaultValue=defaultReputation) + self.init_note_factor(noteInitState, dataset, device, defaultValue=0.0) + self.init_note_intercept(noteInitState, dataset, device, defaultValue=0.0) + + def init_parameter(self, initDf, initCol, idKey, ratersOrNotes, device, defaultValue): + if initDf is not None and initCol in initDf.columns: + idToInitValue = dict(initDf[[idKey, initCol]].values) + logger.info(f"Initializing {initCol}:") + logger.info( + f" num in dataset: {ratersOrNotes.shape[0]}, vs. num we are initializing: {len(initDf)}" ) - print(f" uninitialized raters: {(self.raterEmbedding.weight == 0).flatten().sum()}") - print(f" initialized raters: {(self.raterEmbedding.weight != 0).flatten().sum()}") - if noteInitState is not None: - print("Initializing notes:") - print(f" num notes: {dataset.notes.shape[0]}") - mapping = dict(noteInitState[[c.noteIdKey, c.internalNoteFactor1Key]].values) - self.noteEmbedding.weight = nn.Parameter( - torch.tensor([mapping.get(note, 0.0) for note in dataset.notes]) - .reshape(-1, 1) + paramWeightToInit = nn.Parameter( + torch.tensor( + [ + get_or_default_if_nan(lookupDict=idToInitValue, key=raterOrNoteId, default=defaultValue) + for raterOrNoteId in ratersOrNotes + ] + ) .to(torch.float32) + .reshape(-1, 1) .to(device) ) - print(f" uninitialized notes: {(self.noteEmbedding.weight == 0).flatten().sum()}") - print(f" initialized notes: {(self.noteEmbedding.weight != 0).flatten().sum()}") + logger.info(f" uninitialized {initCol}s: {(paramWeightToInit == 0).flatten().sum()}") + logger.info(f" initialized {initCol}s: {(paramWeightToInit != 0).flatten().sum()}") + return paramWeightToInit + else: + logger.info(f"Not initializing {initCol}") + return None + + def init_note_factor(self, noteInitState, dataset, device, defaultValue=0): + initVal = self.init_parameter( + initDf=noteInitState, + initCol=c.internalNoteFactor1Key, + idKey=c.noteIdKey, + ratersOrNotes=dataset.notes, + device=device, + defaultValue=defaultValue, + ) + if initVal is not None: + self.noteEmbedding.weight = initVal + assert not torch.isnan(self.noteEmbedding.weight).any() + + def init_note_intercept(self, noteInitState, dataset, device, defaultValue=0): + initVal = self.init_parameter( + initDf=noteInitState, + initCol=c.internalNoteInterceptKey, + idKey=c.noteIdKey, + ratersOrNotes=dataset.notes, + device=device, + defaultValue=defaultValue, + ) + if initVal is not None: + self.noteBias.weight = initVal + assert not torch.isnan(self.noteBias.weight).any() + + def init_rater_factor(self, raterInitState, dataset, device, defaultValue=0): + initVal = self.init_parameter( + initDf=raterInitState, + initCol=c.internalRaterFactor1Key, + idKey=c.raterParticipantIdKey, + ratersOrNotes=dataset.raters, + device=device, + defaultValue=defaultValue, + ) + if initVal is not None: + self.raterEmbedding.weight = initVal + assert not torch.isnan(self.raterEmbedding.weight).any() + + def init_rater_reputation(self, raterInitState, dataset, device, defaultValue): + initVal = self.init_parameter( + initDf=raterInitState, + initCol=c.internalRaterReputationKey, + idKey=c.raterParticipantIdKey, + ratersOrNotes=dataset.raters, + device=device, + defaultValue=defaultValue, + ) + if initVal is not None: + self.raterReputation.weight = initVal + assert not torch.isnan(self.raterReputation.weight).any() + + def init_rater_intercept(self, raterInitState, dataset, device, defaultValue=0): + initVal = self.init_parameter( + initDf=raterInitState, + initCol=c.internalRaterInterceptKey, + idKey=c.raterParticipantIdKey, + ratersOrNotes=dataset.raters, + device=device, + defaultValue=defaultValue, + ) + if initVal is not None: + self.raterBias.weight = initVal + assert not torch.isnan(self.raterBias.weight).any() + + def init_global_bias(self, globalInterceptInit): + if globalInterceptInit is not None: + self.globalBias = nn.Parameter(torch.tensor(globalInterceptInit, **self.format)) + else: + self.globalBias = nn.Parameter(torch.tensor(0.0, **self.format)) + assert not torch.isnan(self.globalBias).any() def forward(self, notes, raters): pred = (self.noteEmbedding(notes) * self.raterEmbedding(raters)).sum( @@ -118,15 +227,52 @@ def forward(self, notes, raters): pred += self.raterBias(raters) + self.globalBias return self.activation_fn(pred) - def get_regularization_loss(self): - regularizationLoss = ( - (self.l2Lambda * (self.noteEmbedding.weight**2).mean()) - + (self.l2Lambda * (self.raterEmbedding.weight**2).mean()) - + (self.l2Lambda * self.l2NoteBiasMultiplier * (self.noteBias.weight**2).mean()) - + (self.l2Lambda * self.l2RaterBiasMultiplier * (self.raterBias.weight**2).mean()) - + (self.l2Lambda * self.l2RaterReputationMultiplier * (self.raterReputation.weight**2).mean()) - + (self.l2Lambda * self.l2GlobalBiasMultiplier * (self.globalBias**2)) - ) + def get_regularization_loss(self, numRatings): + regularizationLoss = self.l2Lambda * self.l2GlobalBiasMultiplier * (self.globalBias**2) + + if self._ratingPerNoteLossRatio is None: + regularizationLoss += self.l2Lambda * (self.noteEmbedding.weight**2).mean() + regularizationLoss += ( + self.l2Lambda * self.l2NoteBiasMultiplier * (self.noteBias.weight**2).mean() + ) + else: + simulatedNumberOfNotesForLoss = numRatings / self._ratingPerNoteLossRatio + regularizationLoss += ( + self.l2Lambda * (self.noteEmbedding.weight**2).sum() / simulatedNumberOfNotesForLoss + ) + regularizationLoss += ( + self.l2Lambda + * self.l2NoteBiasMultiplier + * (self.noteBias.weight**2).sum() + / simulatedNumberOfNotesForLoss + ) + + if self._ratingPerUserLossRatio is None: + regularizationLoss += self.l2Lambda * (self.raterEmbedding.weight**2).mean() + regularizationLoss += ( + self.l2Lambda * self.l2RaterBiasMultiplier * (self.raterBias.weight**2).mean() + ) + regularizationLoss += ( + self.l2Lambda * self.l2RaterReputationMultiplier * (self.raterReputation.weight**2).mean() + ) + else: + simulatedNumberOfRatersForLoss = numRatings / self._ratingPerUserLossRatio + regularizationLoss += ( + self.l2Lambda * (self.raterEmbedding.weight**2).sum() / simulatedNumberOfRatersForLoss + ) + regularizationLoss += ( + self.l2Lambda + * self.l2RaterBiasMultiplier + * (self.raterBias.weight**2).sum() + / simulatedNumberOfRatersForLoss + ) + regularizationLoss += ( + self.l2Lambda + * self.l2RaterReputationMultiplier + * (self.raterReputation.weight**2).sum() + / simulatedNumberOfRatersForLoss + ) + return regularizationLoss @@ -135,6 +281,7 @@ def _train_one_round(model, loss_fn, dataset, hParams): # Identify tensors for training and testing notes = dataset.noteTensor raters = dataset.raterTensor + numRatings = dataset.raterTensor.shape[0] # Initilaize training state optim = torch.optim.Adam(model.parameters(), lr=hParams.learningRate) epoch = 0 @@ -147,13 +294,16 @@ def _train_one_round(model, loss_fn, dataset, hParams): pred = model(notes, raters) # Compute loss loss = loss_fn(pred.flatten()) - loss += model.get_regularization_loss() + loss += model.get_regularization_loss(numRatings) + assert not torch.isnan(loss).any() if hParams.logRate and epoch % hParams.logRate == 0: - print(f"epoch={epoch:03d} | loss={loss.item():7.4f} | time={time.time() - start:.1f}s") + logger.info(f"epoch={epoch:03d} | loss={loss.item():7.6f} | time={time.time() - start:.1f}s") if hParams.convergence > 0 and epoch % hParams.stablePeriod == 0: if priorLoss is not None and (priorLoss - loss).abs() < hParams.convergence: if hParams.logRate: - print(f"epoch={epoch:03d} | loss={loss.item():7.4f} | time={time.time() - start:.1f}s") + logger.info( + f"epoch={epoch:03d} | loss={loss.item():7.6f} | time={time.time() - start:.1f}s" + ) break priorLoss = loss # Perform backward pass @@ -170,19 +320,13 @@ def _sigmoid_range(low, high): return lambda tensor: sigmoid_fn(tensor) * (high - low) + low -# TODO: replace string constants with enums -def train_model( - hParams, - dataset, - noteInitState: Optional[pd.DataFrame] = None, - raterInitState: Optional[pd.DataFrame] = None, - device=torch.device("cpu"), +def _setup_model( + dataset, # MatrixFactorizationDataset, + hParams: ReputationModelHyperparameters, + noteInitState: pd.DataFrame, + raterInitState: pd.DataFrame, + globalInterceptInit: Optional[float] = None, ): - # Unpack dataset - notes = dataset.noteTensor - raters = dataset.raterTensor - targets = dataset.targetTensor - # Define model activation_fn = None if hParams.activationFunction == "SIGMOID": @@ -199,6 +343,9 @@ def train_model( assert hParams.lossFunction == "BCEWithLogitsLoss" loss_fn = nn.BCEWithLogitsLoss(reduction="none") + logger.info( + f"Setup model: noteInitState: \n{noteInitState},\n raterInitState: \n{raterInitState}" + ) model = ReputationMFModel( dataset, activation_fn=activation_fn, @@ -210,16 +357,31 @@ def train_model( l2RaterReputationMultiplier=hParams.l2RaterReputationMultiplier, noteInitState=noteInitState, raterInitState=raterInitState, + globalInterceptInit=globalInterceptInit, + defaultReputation=hParams.defaultReputation, + ratingPerNoteLossRatio=hParams.ratingPerNoteLossRatio, + ratingPerUserLossRatio=hParams.ratingPerUserLossRatio, ) + return model, loss_fn + + +# TODO: replace string constants with enums +def train_model_prescoring( + hParams, + dataset, + noteInitState: Optional[pd.DataFrame] = None, + raterInitState: Optional[pd.DataFrame] = None, + device=torch.device("cpu"), +): + model, loss_fn = _setup_model(dataset, hParams, noteInitState, raterInitState) - # train round 1 - print("Reputation Matrix Factorization:") - print("Round 1:") + logger.info("Reputation Matrix Factorization: rater reputation frozen") + logger.info("Round 1:") loss_fn_1 = WeightedLoss( loss_fn, - notes, - raters, - targets, + dataset.noteTensor, + dataset.raterTensor, + dataset.targetTensor, posWeight=hParams.posWeight, noteNormExp=hParams.noteNormExpFirstRound, raterNormExp=hParams.raterNormExpFirstRound, @@ -227,14 +389,15 @@ def train_model( ) model.raterReputation.requires_grad_(False) loss1 = _train_one_round(model, loss_fn_1, dataset, hParams) + logger.info(f"After round 1, global bias: {model.globalBias}") + globalInt1 = model.globalBias.data.cpu().detach().numpy().item() - # train round 2 - print("\nRound 2:") + logger.info("\nRound 2: learn rater rep (and everything else), freeze note intercept") loss_fn_2 = WeightedLoss( loss_fn, - notes, - raters, - targets, + dataset.noteTensor, + dataset.raterTensor, + dataset.targetTensor, posWeight=hParams.posWeight * hParams.posWeightSecondRoundMultiplier, noteNormExp=hParams.noteNormExpSecondRound, raterNormExp=hParams.raterNormExpSecondRound, @@ -242,22 +405,26 @@ def train_model( ) model.raterReputation.requires_grad_(True) model.noteBias.requires_grad_(False) + loss2 = _train_one_round(model, loss_fn_2, dataset, hParams) + globalInt2 = model.globalBias.data.cpu().detach().numpy().item() + noteIntercept2 = model.noteBias.weight.cpu().flatten().detach().numpy().copy() + raterIntercept2 = model.raterBias.weight.cpu().flatten().detach().numpy().copy() - # train round 3 - print("\nRound 3:") + logger.info("\nRound 3: fit intercepts and global intercept with everything else frozen") model.l2Lambda = hParams.l2Lambda * hParams.l2LambdaThirdRoundMultiplier model.l2NoteBiasMultiplier = hParams.l2NoteBiasMultiplier * hParams.l2NoteBiasThirdRoundMultiplier model.noteBias.requires_grad_(True) model.noteEmbedding.requires_grad_(False) model.raterEmbedding.requires_grad_(False) model.raterReputation.requires_grad_(False) + raterReputation = model.raterReputation.weight.detach().clone().clip(min=0) loss_fn_3 = WeightedLoss( loss_fn, - notes, - raters, - targets, + dataset.noteTensor, + dataset.raterTensor, + dataset.targetTensor, posWeight=hParams.posWeight * hParams.posWeightThirdRoundMultiplier, raterReputation=raterReputation, reputationExp=hParams.reputationExp, @@ -266,6 +433,102 @@ def train_model( raterNormExp=hParams.raterNormExpThirdRound, device=device, ) + loss3 = _train_one_round(model, loss_fn_3, dataset, hParams) - return model, loss1, loss2, loss3 + logger.info(f"After round 3, global bias: {model.globalBias}") + globalInt3 = model.globalBias.data.cpu().detach().numpy().item() + globalIntercept = c.ReputationGlobalIntercept( + firstRound=globalInt1, secondRound=globalInt2, finalRound=globalInt3 + ) + + return model, loss1, loss2, loss3, globalIntercept, noteIntercept2, raterIntercept2 + + +def train_model_final( + hParams, + dataset, + noteInitState: pd.DataFrame, + raterInitState: pd.DataFrame, + globalInterceptInit: c.ReputationGlobalIntercept, + device=torch.device("cpu"), +): + """ + Args: + hParams (ReputationModelHyperparameters) + dataset (ReputationDataset) + noteInitState (Optional[pd.DataFrame]): expects internal column names e.g. internalNoteIntercept + raterInitState (Optional[pd.DataFrame]): expects internal column names e.g. internalRaterIntercept + """ + hParams.defaultReputation = 0.0 # 0 reputation for raters missing from init. + + # setup_model initializes uses the internal intercepts, but we want to initialize with round 2 intercepts, + # and save the final rater intercepts for later initialization. + noteInitState[c.internalNoteInterceptKey] = noteInitState[c.internalNoteInterceptRound2Key] + + savedFinalRoundPrescoringRaterIntercept = raterInitState[c.internalRaterInterceptKey].copy() + raterInitState[c.internalRaterInterceptKey] = raterInitState[c.internalRaterInterceptRound2Key] + + model, loss_fn = _setup_model( + dataset, hParams, noteInitState, raterInitState, globalInterceptInit.secondRound + ) + + logger.info( + "Final scoring, initial round fitting reputation MF (equivalent to Round 2 in Prescoring - learn note factor)" + ) + + model.noteBias.requires_grad_(False) + model.noteEmbedding.requires_grad_(True) + model.raterEmbedding.requires_grad_(False) + model.raterReputation.requires_grad_(False) + model.raterBias.requires_grad_(False) + model.globalBias.requires_grad_(False) + + loss_fn_2 = WeightedLoss( + loss_fn, + dataset.noteTensor, + dataset.raterTensor, + dataset.targetTensor, + posWeight=hParams.posWeight * hParams.posWeightSecondRoundMultiplier, + noteNormExp=hParams.noteNormExpSecondRound, + raterNormExp=hParams.raterNormExpSecondRound, + device=device, + ) + _train_one_round(model, loss_fn_2, dataset, hParams) + + logger.info("Final scoring, final round fitting reputation MF: learn just note intercept") + + # Now set the global intercept to the value from the final round + model.globalBias.data = torch.tensor(globalInterceptInit.finalRound, **model.format) + + # Set rater intercepts back to final round. We will learn note intercepts, so no need to set them back. + raterInitState[c.internalRaterInterceptKey] = savedFinalRoundPrescoringRaterIntercept + model.init_rater_intercept(raterInitState, dataset, device) + + model.l2Lambda = hParams.l2Lambda * hParams.l2LambdaThirdRoundMultiplier + model.l2NoteBiasMultiplier = hParams.l2NoteBiasMultiplier * hParams.l2NoteBiasThirdRoundMultiplier + + model.noteBias.requires_grad_(True) + model.noteEmbedding.requires_grad_(False) + model.raterEmbedding.requires_grad_(False) + model.raterBias.requires_grad_(False) + model.raterReputation.requires_grad_(False) + model.globalBias.requires_grad_(False) + + raterReputation = model.raterReputation.weight.detach().clone().clip(min=0) + loss_fn_final = WeightedLoss( + loss_fn, + dataset.noteTensor, + dataset.raterTensor, + dataset.targetTensor, + posWeight=hParams.posWeight * hParams.posWeightThirdRoundMultiplier, + raterReputation=raterReputation, + reputationExp=hParams.reputationExp, + alpha=hParams.alpha, + noteNormExp=hParams.noteNormExpThirdRound, + raterNormExp=hParams.raterNormExpThirdRound, + device=device, + ) + + loss_final = _train_one_round(model, loss_fn_final, dataset, hParams) + return model, loss_final diff --git a/sourcecode/scoring/reputation_scorer.py b/sourcecode/scoring/reputation_scorer.py index ea580815..31af5439 100644 --- a/sourcecode/scoring/reputation_scorer.py +++ b/sourcecode/scoring/reputation_scorer.py @@ -1,17 +1,24 @@ -from typing import List, Optional, Tuple +import logging +from typing import Dict, List, Optional, Tuple from . import constants as c from .matrix_factorization.matrix_factorization import MatrixFactorization from .mf_base_scorer import get_ratings_for_stable_init -from .mf_core_scorer import filter_core_input from .process_data import filter_ratings -from .reputation_matrix_factorization.helpfulness_model import get_helpfulness_reputation_results -from .scorer import Scorer +from .reputation_matrix_factorization.helpfulness_model import ( + get_helpfulness_reputation_results_final, + get_helpfulness_reputation_results_prescoring, +) +from .scorer import EmptyRatingException, Scorer import pandas as pd import torch +logger = logging.getLogger("birdwatch.reputation_scorer") +logger.setLevel(logging.INFO) + + class ReputationScorer(Scorer): """Applies reputation matrix factorization to helpfulness bridging.""" @@ -34,7 +41,14 @@ def __init__( in scoring. Notes with fewer ratings are removed. threads: number of threads to use for intra-op parallelism in pytorch """ - super().__init__(seed, threads) + super().__init__( + includedTopics=set(), + includedGroups=c.coreGroups, + includeUnassigned=True, + captureThreshold=0.5, + seed=seed, + threads=threads, + ) self._minNumRatingsPerRater = minNumRatingsPerRater self._minNumRatersPerNote = minNumRatersPerNote self._crhThreshold = crhThreshold @@ -63,16 +77,15 @@ def get_internal_scored_notes_cols(self) -> List[str]: def get_helpfulness_scores_cols(self) -> List[str]: """Returns a list of columns which should be present in the helpfulnessScores output.""" - return [ - c.raterParticipantIdKey, - c.raterHelpfulnessReputationKey, - ] + return [c.raterParticipantIdKey, c.raterHelpfulnessReputationKey] def get_internal_helpfulness_scores_cols(self) -> List[str]: """Returns a list of columns which should be present in the helpfulnessScores output.""" return [ c.raterParticipantIdKey, - c.raterHelpfulnessReputationKey, + c.internalRaterInterceptKey, + c.internalRaterFactor1Key, + c.internalRaterReputationKey, ] def get_auxiliary_note_info_cols(self) -> List[str]: @@ -87,33 +100,56 @@ def _get_dropped_user_cols(self) -> List[str]: """Returns a list of columns which should be excluded from helpfulnessScores output.""" return [] - def _filter_input( - self, - noteTopics: pd.DataFrame, - ratings: pd.DataFrame, - noteStatusHistory: pd.DataFrame, - userEnrollment: pd.DataFrame, - ) -> Tuple[pd.DataFrame, pd.DataFrame]: - ratings, noteStatusHistory = filter_core_input(ratings, noteStatusHistory, userEnrollment) - ratings = filter_ratings(ratings, self._minNumRatingsPerRater, self._minNumRatersPerNote) - return ratings, noteStatusHistory + def _get_user_col_mapping(self) -> Dict[str, str]: + """Returns a dict mapping default user column names to custom names for a specific model.""" + return { + c.internalRaterReputationKey: c.raterHelpfulnessReputationKey, + } def _prescore_notes_and_users( self, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, userEnrollmentRaw: pd.DataFrame - ) -> Tuple[pd.DataFrame, pd.DataFrame]: + ) -> Tuple[pd.DataFrame, pd.DataFrame, c.PrescoringMetaScorerOutput]: if self._seed is not None: - print(f"seeding with {self._seed}") + logger.info(f"seeding with {self._seed}") torch.manual_seed(self._seed) + ratings = filter_ratings(ratings, self._minNumRatingsPerRater, self._minNumRatersPerNote) # Calculate initialization factors if necessary - noteParamsInit = pd.DataFrame() - raterParamsInit = pd.DataFrame() + noteParamsInit = None + raterParamsInit = None if self._modelingGroupToInitializeForStability: ratingsForStableInitialization = get_ratings_for_stable_init( ratings, userEnrollmentRaw, self._modelingGroupToInitializeForStability ) mfRanker = MatrixFactorization() noteParamsInit, raterParamsInit, _ = mfRanker.run_mf(ratingsForStableInitialization) - return noteParamsInit, raterParamsInit + + # We only want to use factors to initialize, not intercepts + noteParamsInit = noteParamsInit[[c.noteIdKey, c.internalNoteFactor1Key]] + raterParamsInit = raterParamsInit[[c.raterParticipantIdKey, c.internalRaterFactor1Key]] + + # Fit multi-phase prescoring for reputation model + noteStats, raterStats, globalIntercept = get_helpfulness_reputation_results_prescoring( + ratings, noteInitState=noteParamsInit, raterInitState=raterParamsInit + ) + # Fill in NaN values for any missing notes + noteStats = noteStats.merge(noteStatusHistory[[c.noteIdKey]].drop_duplicates(), how="outer") + assert len(noteStats) == len(noteStatusHistory) + logger.info( + f"""Reputation prescoring: returning these columns: + noteStats: {noteStats.columns} + raterStats: {raterStats.columns} + """ + ) + + metaScorerOutput = c.PrescoringMetaScorerOutput( + globalIntercept=None, + lowDiligenceGlobalIntercept=globalIntercept, + tagFilteringThresholds=None, + finalRoundNumRatings=None, + finalRoundNumNotes=None, + finalRoundNumUsers=None, + ) + return noteStats, raterStats, metaScorerOutput def _score_notes_and_users( self, @@ -121,15 +157,26 @@ def _score_notes_and_users( noteStatusHistory: pd.DataFrame, prescoringNoteModelOutput: pd.DataFrame, prescoringRaterModelOutput: pd.DataFrame, - usePreviouslySavedStateIfExists: bool = True, + prescoringMetaScorerOutput: c.PrescoringMetaScorerOutput, ) -> Tuple[pd.DataFrame, pd.DataFrame]: if self._seed is not None: - print(f"seeding with {self._seed}") + logger.info(f"seeding with {self._seed}") torch.manual_seed(self._seed) + ratings = filter_ratings(ratings, self._minNumRatingsPerRater, self._minNumRatersPerNote) + if len(ratings) == 0: + raise EmptyRatingException() # Apply model - noteStats, raterStats = get_helpfulness_reputation_results( - ratings, noteInitState=prescoringNoteModelOutput, raterInitState=prescoringRaterModelOutput + # Note: we use the low diligence global intercept here as a temporary hack, since the prod scorer's + # globalIntercept field is a float and we need to store a c.ReputationGlobalIntercept. + assert ( + prescoringMetaScorerOutput.lowDiligenceGlobalIntercept is not None + ), "Missing prescoring global intercept" + noteStats, raterStats = get_helpfulness_reputation_results_final( + ratings, + noteInitState=prescoringNoteModelOutput, + raterInitState=prescoringRaterModelOutput, + globalIntercept=prescoringMetaScorerOutput.lowDiligenceGlobalIntercept, ) # Assign rating status noteStats[c.coverageRatingStatusKey] = c.needsMoreRatings diff --git a/sourcecode/scoring/run_scoring.py b/sourcecode/scoring/run_scoring.py index b6587eed..f77e490a 100644 --- a/sourcecode/scoring/run_scoring.py +++ b/sourcecode/scoring/run_scoring.py @@ -6,7 +6,10 @@ """ import concurrent.futures import copy +import gc +import io from itertools import chain +import logging import multiprocessing from multiprocessing import shared_memory # type: ignore import time @@ -21,12 +24,24 @@ from .mf_expansion_scorer import MFExpansionScorer from .mf_group_scorer import ( MFGroupScorer, - coalesce_group_models, + coalesce_group_model_helpfulness_scores, + coalesce_group_model_scored_notes, groupScorerCount, + groupScorerParalleism, trialScoringGroup, ) +from .mf_multi_group_scorer import ( + MFMultiGroupScorer, + coalesce_multi_group_model_helpfulness_scores, + coalesce_multi_group_model_scored_notes, +) from .mf_topic_scorer import MFTopicScorer, coalesce_topic_models -from .process_data import CommunityNotesDataLoader +from .pandas_utils import get_df_info, keep_columns +from .post_selection_similarity import ( + PostSelectionSimilarity, + filter_ratings_by_post_selection_similarity, +) +from .process_data import CommunityNotesDataLoader, filter_input_data_for_testing, preprocess_data from .reputation_scorer import ReputationScorer from .scorer import Scorer from .scoring_rules import RuleID @@ -34,12 +49,16 @@ import numpy as np import pandas as pd +import sklearn + + +logger = logging.getLogger("birdwatch.run_scoring") +logger.setLevel(logging.INFO) def _get_scorers( seed: Optional[int], pseudoraters: Optional[bool], - enabledScorers: Optional[Set[Scorers]], useStableInitialization: bool = True, ) -> Dict[Scorers, List[Scorer]]: """Instantiate all Scorer objects which should be used for note ranking. @@ -47,84 +66,80 @@ def _get_scorers( Args: seed (int, optional): if not None, base distinct seeds for the first and second MF rounds on this value pseudoraters (bool, optional): if True, compute optional pseudorater confidence intervals - enabledScorers: if not None, set of which scorers should be instantiated and enabled Returns: Dict[Scorers, List[Scorer]] containing instantiated Scorer objects for note ranking. """ scorers: Dict[Scorers, List[Scorer]] = dict() - - if enabledScorers is None or Scorers.MFCoreScorer in enabledScorers: - scorers[Scorers.MFCoreScorer] = [ - MFCoreScorer(seed, pseudoraters, useStableInitialization=useStableInitialization, threads=12) - ] - if enabledScorers is None or Scorers.MFExpansionScorer in enabledScorers: - scorers[Scorers.MFExpansionScorer] = [ - MFExpansionScorer(seed, useStableInitialization=useStableInitialization, threads=12) - ] - if enabledScorers is None or Scorers.MFExpansionPlusScorer in enabledScorers: - scorers[Scorers.MFExpansionPlusScorer] = [ - MFExpansionPlusScorer(seed, useStableInitialization=useStableInitialization, threads=12) - ] - if enabledScorers is None or Scorers.ReputationScorer in enabledScorers: - scorers[Scorers.ReputationScorer] = [ - ReputationScorer(seed, useStableInitialization=useStableInitialization, threads=12) - ] - if enabledScorers is None or Scorers.MFGroupScorer in enabledScorers: - # Note that index 0 is reserved, corresponding to no group assigned, so scoring group - # numbers begin with index 1. - scorers[Scorers.MFGroupScorer] = [ - # Scoring Group 13 is currently the largest by far, so total runtime benefits from - # adding the group scorers in descending order so we start work on Group 13 first. - MFGroupScorer(groupNumber=i, seed=seed) - for i in range(groupScorerCount, 0, -1) - if i != trialScoringGroup - ] - scorers[Scorers.MFGroupScorer].append( - MFGroupScorer( - groupNumber=trialScoringGroup, - seed=seed, - noteInterceptLambda=0.03 * 30, - userInterceptLambda=0.03 * 5, - globalInterceptLambda=0.03 * 5, - noteFactorLambda=0.03 / 3, - userFactorLambda=0.03 / 4, - diamondLambda=0.03 * 25, - normalizedLossHyperparameters=NormalizedLossHyperparameters( - globalSignNorm=True, noteSignAlpha=None, noteNormExp=0, raterNormExp=-0.25 - ), - maxFinalMFTrainError=0.16, - requireInternalAuthor=False, - groupThreshold=0.4, - minMeanNoteScore=-0.01, - crhThreshold=0.09, - crhSuperThreshold=0.2, - crnhThresholdIntercept=-0.01, - crnhThresholdNoteFactorMultiplier=0, - crnhThresholdNMIntercept=-0.02, - lowDiligenceThreshold=1000, - factorThreshold=0.4, - multiplyPenaltyByHarassmentScore=False, - minimumHarassmentScoreToPenalize=2.5, - tagConsensusHarassmentHelpfulRatingPenalty=10, - ) + scorers[Scorers.MFCoreScorer] = [ + MFCoreScorer(seed, pseudoraters, useStableInitialization=useStableInitialization, threads=12) + ] + scorers[Scorers.MFExpansionScorer] = [ + MFExpansionScorer(seed, useStableInitialization=useStableInitialization, threads=12) + ] + scorers[Scorers.MFExpansionPlusScorer] = [ + MFExpansionPlusScorer(seed, useStableInitialization=useStableInitialization, threads=12) + ] + scorers[Scorers.ReputationScorer] = [ + ReputationScorer(seed, useStableInitialization=useStableInitialization, threads=12) + ] + # Note that index 0 is reserved, corresponding to no group assigned, so scoring group + # numbers begin with index 1. + scorers[Scorers.MFGroupScorer] = [ + # Scoring Group 13 is currently the largest by far, so total runtime benefits from + # adding the group scorers in descending order so we start work on Group 13 first. + MFGroupScorer(includedGroups={i}, groupId=i, threads=groupScorerParalleism.get(i, 4), seed=seed) + for i in range(groupScorerCount, 0, -1) + if i != trialScoringGroup + ] + scorers[Scorers.MFGroupScorer].append( + MFGroupScorer( + includedGroups={trialScoringGroup}, + groupId=trialScoringGroup, + threads=groupScorerParalleism.get(trialScoringGroup, 4), + seed=seed, + noteInterceptLambda=0.03 * 30, + userInterceptLambda=0.03 * 5, + globalInterceptLambda=0.03 * 5, + noteFactorLambda=0.03 / 3, + userFactorLambda=0.03 / 4, + diamondLambda=0.03 * 25, + normalizedLossHyperparameters=NormalizedLossHyperparameters( + globalSignNorm=True, noteSignAlpha=None, noteNormExp=0, raterNormExp=-0.25 + ), + maxFinalMFTrainError=0.16, + groupThreshold=0.4, + minMeanNoteScore=-0.01, + crhThreshold=0.15, + crhSuperThreshold=None, + crnhThresholdIntercept=-0.01, + crnhThresholdNoteFactorMultiplier=0, + crnhThresholdNMIntercept=-0.02, + lowDiligenceThreshold=1000, + factorThreshold=0.4, + multiplyPenaltyByHarassmentScore=False, + minimumHarassmentScoreToPenalize=2.5, + tagConsensusHarassmentHelpfulRatingPenalty=10, + tagFilterPercentile=90, + incorrectFilterThreshold=1.5, ) - if enabledScorers is None or Scorers.MFTopicScorer in enabledScorers: - scorers[Scorers.MFTopicScorer] = [ - MFTopicScorer(topicName=topic.name, seed=seed) for topic in Topics - ] + ) + scorers[Scorers.MFTopicScorer] = [ + MFTopicScorer(topicName=topic.name, seed=seed) for topic in Topics + ] + scorers[Scorers.MFMultiGroupScorer] = [ + MFMultiGroupScorer(includedGroups={4, 5, 7, 12, 26}, groupId=1, threads=4, seed=seed), + ] return scorers def _merge_results( scoredNotes: pd.DataFrame, - helpfulnessScores: pd.DataFrame, auxiliaryNoteInfo: pd.DataFrame, modelScoredNotes: pd.DataFrame, - modelHelpfulnessScores: Optional[pd.DataFrame], modelauxiliaryNoteInfo: Optional[pd.DataFrame], -) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: +) -> Tuple[pd.DataFrame, pd.DataFrame]: """Merges results from a specific model with results from prior models. The DFs returned by each model will be (outer) merged and passed through directly to the @@ -133,10 +148,8 @@ def _merge_results( Args: scoredNotes: pd.DataFrame containing key scoring results - helpfulnessScores: pd.DataFrame containing contributor specific scoring results auxiliaryNoteInfo: pd.DataFrame containing intermediate scoring state modelScoredNotes: pd.DataFrame containing scoredNotes result for a particular model - modelHelpfulnessScores: None or pd.DataFrame containing helpfulnessScores result for a particular model modelauxiliaryNoteInfo: None or pd.DataFrame containing auxiliaryNoteInfo result for a particular model Returns: @@ -147,20 +160,26 @@ def _merge_results( c.noteIdKey }, "column names must be globally unique" scoredNotesSize = len(scoredNotes) - scoredNotes = scoredNotes.merge(modelScoredNotes, on=c.noteIdKey, how="outer") + unsafeAllowed = set( + [ + c.noteIdKey, + c.defaultIndexKey, + ] + + [f"{c.modelingGroupKey}_{group}" for group in range(groupScorerCount, 0, -1)] + + [f"{c.topicNoteConfidentKey}_{topic.name}" for topic in Topics] + + [f"{c.groupNumFinalRoundRatingsKey}_{group}" for group in range(groupScorerCount, 0, -1)] + + [f"{c.topicNumFinalRoundRatingsKey}_{topic.name}" for topic in Topics] + ) + scoredNotes = scoredNotes.merge( + modelScoredNotes, + on=c.noteIdKey, + how="outer", + unsafeAllowed=unsafeAllowed, + ) assert len(scoredNotes) == scoredNotesSize, "scoredNotes should not expand" - # Merge helpfulnessScores - if modelHelpfulnessScores is not None: - assert (set(modelHelpfulnessScores.columns) & set(helpfulnessScores.columns)) == { - c.raterParticipantIdKey - }, "column names must be globally unique" - helpfulnessScores = helpfulnessScores.merge( - modelHelpfulnessScores, on=c.raterParticipantIdKey, how="outer" - ) - # Merge auxiliaryNoteInfo - if modelauxiliaryNoteInfo is not None: + if modelauxiliaryNoteInfo is not None and len(modelauxiliaryNoteInfo.columns) > 0: assert (set(modelauxiliaryNoteInfo.columns) & set(auxiliaryNoteInfo.columns)) == { c.noteIdKey }, "column names must be globally unique" @@ -168,7 +187,7 @@ def _merge_results( auxiliaryNoteInfo = auxiliaryNoteInfo.merge(modelauxiliaryNoteInfo, on=c.noteIdKey, how="outer") assert len(auxiliaryNoteInfo) == auxiliaryNoteInfoSize, "auxiliaryNoteInfo should not expand" - return scoredNotes, helpfulnessScores, auxiliaryNoteInfo + return scoredNotes, auxiliaryNoteInfo def _load_data_with_data_loader_parallelizable( @@ -248,17 +267,17 @@ def _run_scorer_parallelizable( scoringArgs = copy.deepcopy(scoringArgs) if scoringArgsSharedMemory is not None: - print( + logger.info( f"{scorer.get_name()} run_scorer_parallelizable just started in parallel: loading data from shared memory." ) scoringArgs = _load_data_from_shared_memory_parallelizable( scoringArgsSharedMemory, scoringArgs ) - print( + logger.info( f"{scorer.get_name()} run_scorer_parallelizable just finished loading data from shared memory." ) elif dataLoader is not None: - print( + logger.info( f"{scorer.get_name()} run_scorer_parallelizable just started in parallel: loading data with dataLoader." ) scoringArgs = _load_data_with_data_loader_parallelizable(dataLoader, scoringArgs) @@ -270,7 +289,7 @@ def _run_scorer_parallelizable( # Run scoring scorerStartTime = time.perf_counter() if type(scoringArgs) == PrescoringArgs: - scoringResults = scorer.prescore(scoringArgs) + scoringResults = scorer.prescore(scoringArgs, preserveRatings=not runParallel) elif type(scoringArgs) == FinalScoringArgs: scoringResults = scorer.score_final(scoringArgs) else: @@ -286,19 +305,15 @@ def save_df_to_shared_memory(df: pd.DataFrame, shms: List) -> c.SharedMemoryData and returns the info needed to access it, as well as appends it to the list of shared memory objects so it's not garbage collected and can be closed later. """ - cols = df.columns - data = df.to_numpy() - df_dtypes_dict = dict(list(zip(df.columns, df.dtypes))) - shm = shared_memory.SharedMemory(create=True, size=data.nbytes) - np_array = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf) - np_array[:] = data[:] + with io.BytesIO() as buf: + df.to_parquet(buf, compression="gzip", engine="pyarrow") + size = len(buf.getvalue()) + shm = shared_memory.SharedMemory(create=True, size=size) + shm.buf[:size] = buf.getvalue() shms.append(shm) # save the shared memory object so we can close it later return c.SharedMemoryDataframeInfo( sharedMemoryName=shm.name, - columns=cols, - dataShape=data.shape, - dtypesDict=df_dtypes_dict, - npDtype=np_array.dtype, + dataSize=size, ) @@ -308,12 +323,9 @@ def get_df_from_shared_memory(sharedMemoryDfInfo: c.SharedMemoryDataframeInfo) - Read a dataframe from shared memory and return it. """ existing_shm = shared_memory.SharedMemory(name=sharedMemoryDfInfo.sharedMemoryName) - np_array = np.ndarray( - sharedMemoryDfInfo.dataShape, buffer=existing_shm.buf, dtype=sharedMemoryDfInfo.npDtype - ) - df = pd.DataFrame(np_array, columns=sharedMemoryDfInfo.columns) - df = df.astype(sharedMemoryDfInfo.dtypesDict) - return df + size = sharedMemoryDfInfo.dataSize + with io.BytesIO(existing_shm.buf[:size]) as buf: + return pd.read_parquet(buf) def _save_dfs_to_shared_memory( @@ -324,7 +336,21 @@ def _save_dfs_to_shared_memory( """ shms: List[shared_memory.SharedMemory] = [] noteTopics = save_df_to_shared_memory(scoringArgs.noteTopics, shms) - ratings = save_df_to_shared_memory(scoringArgs.ratings, shms) + ratings = save_df_to_shared_memory( + keep_columns( + scoringArgs.ratings, + [ + c.noteIdKey, + c.raterParticipantIdKey, + c.helpfulNumKey, + c.helpfulnessLevelKey, + c.createdAtMillisKey, + ] + + c.notHelpfulTagsTSVOrder + + c.helpfulTagsTSVOrder, + ), + shms, + ) noteStatusHistory = save_df_to_shared_memory(scoringArgs.noteStatusHistory, shms) userEnrollment = save_df_to_shared_memory(scoringArgs.userEnrollment, shms) @@ -375,6 +401,7 @@ def _run_scorers( """ # Apply scoring algorithms overallStartTime = time.perf_counter() + if runParallel: shms, scoringArgsSharedMemory = _save_dfs_to_shared_memory(scoringArgs) @@ -382,7 +409,7 @@ def _run_scorers( mp_context=multiprocessing.get_context("fork"), max_workers=maxWorkers, ) as executor: - print(f"Starting parallel scorer execution with {len(scorers)} scorers.") + logger.info(f"Starting parallel scorer execution with {len(scorers)} scorers.") # Pass mostly-empty scoringArgs: the data is too large to be copied in-memory to # each process, so must be re-loaded from disk by every scorer's dataLoader. scoringArgs.remove_large_args_for_multiprocessing() @@ -398,6 +425,7 @@ def _run_scorers( for scorer in scorers ] modelResultsAndTimes = [f.result() for f in futures] + logger.info("Got model results from all scorers.") for shm in shms: shm.close() @@ -415,7 +443,7 @@ def _run_scorers( modelResultsTuple, scorerTimesTuple = zip(*modelResultsAndTimes) overallTime = time.perf_counter() - overallStartTime - print( + logger.info( f"""---- Completed individual scorers. Ran in parallel: {runParallel}. Succeeded in {overallTime:.2f} seconds. Individual scorers: (name, runtime): {list(zip( @@ -427,7 +455,9 @@ def _run_scorers( return list(modelResultsTuple) -def combine_prescorer_scorer_results(modelResults: List[ModelResult]): +def combine_prescorer_scorer_results( + modelResults: List[ModelResult], +) -> Tuple[pd.DataFrame, pd.DataFrame, c.PrescoringMetaOutput]: """ Returns dfs with original columns plus an extra scorer name column. """ @@ -435,56 +465,160 @@ def combine_prescorer_scorer_results(modelResults: List[ModelResult]): prescoringNoteModelOutputList = [] raterParamsUnfilteredMultiScorersList = [] + prescoringMetaOutput = c.PrescoringMetaOutput(metaScorerOutput={}) + for modelResult in modelResults: if modelResult.scoredNotes is not None: modelResult.scoredNotes[c.scorerNameKey] = modelResult.scorerName prescoringNoteModelOutputList.append(modelResult.scoredNotes) + if modelResult.helpfulnessScores is not None: modelResult.helpfulnessScores[c.scorerNameKey] = modelResult.scorerName raterParamsUnfilteredMultiScorersList.append(modelResult.helpfulnessScores) - prescoringNoteModelOutput = pd.concat(prescoringNoteModelOutputList) - raterParamsUnfilteredMultiScorers = pd.concat(raterParamsUnfilteredMultiScorersList) - return prescoringNoteModelOutput, raterParamsUnfilteredMultiScorers + if modelResult.metaScores is not None and modelResult.scorerName is not None: + prescoringMetaOutput.metaScorerOutput[modelResult.scorerName] = modelResult.metaScores + + prescoringNoteModelOutput = pd.concat( + prescoringNoteModelOutputList, + unsafeAllowed={ + c.defaultIndexKey, + c.noteIdKey, + c.internalNoteInterceptKey, + c.internalNoteFactor1Key, + c.lowDiligenceNoteInterceptKey, + c.lowDiligenceNoteFactor1Key, + }, + ) + # BUG: The type error for this concat operation shows a mix of Int64 and float64 values in + # some columns, suggesting that an input may be emtpy. The type error is preceeded by this + # warning from Pandas, which also points to an empty input: + # FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is + # deprecated. In a future version, this will no longer exclude empty or all-NA columns + # when determining the result dtypes. To retain the old behavior, exclude the relevant + # entries before the concat operation. + # All columns below except incorrectTagRatingsMadeByRater and raterParticipantId show a mix + # of float64/float32 and object. incorrectTagRatingsMadeByRater mixes Int64/Int8 and object, + # and raterParticipantId mixes Int64 and object. + raterParamsUnfilteredMultiScorers = pd.concat( + raterParamsUnfilteredMultiScorersList, + unsafeAllowed={ + c.defaultIndexKey, + c.internalRaterInterceptKey, + c.internalRaterFactor1Key, + c.crhCrnhRatioDifferenceKey, + c.meanNoteScoreKey, + c.raterAgreeRatioKey, + c.aboveHelpfulnessThresholdKey, + c.internalRaterReputationKey, + c.lowDiligenceRaterInterceptKey, + c.lowDiligenceRaterFactor1Key, + c.lowDiligenceRaterReputationKey, + c.incorrectTagRatingsMadeByRaterKey, + c.raterParticipantIdKey, + }, + ) + return ( + prescoringNoteModelOutput[c.prescoringNoteModelOutputTSVColumns], + raterParamsUnfilteredMultiScorers[c.prescoringRaterModelOutputTSVColumns], + prescoringMetaOutput, + ) def combine_final_scorer_results( modelResultsFromEachScorer: List[ModelResult], noteStatusHistory: pd.DataFrame, -): +) -> Tuple[pd.DataFrame, pd.DataFrame]: """ Returns: - Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: + Tuple[pd.DataFrame, pd.DataFrame]: scoredNotes pd.DataFrame: one row per note contained note scores and parameters. - helpfulnessScores pd.DataFrame: one row per user containing a column for each helpfulness score. auxiliaryNoteInfo pd.DataFrame: one row per note containing supplemental values used in scoring. """ # Initialize return data frames. scoredNotes = noteStatusHistory[[c.noteIdKey]].drop_duplicates() auxiliaryNoteInfo = noteStatusHistory[[c.noteIdKey]].drop_duplicates() - helpfulnessScores = pd.DataFrame({c.raterParticipantIdKey: []}) # Merge the results for modelResult in modelResultsFromEachScorer: - scoredNotes, helpfulnessScores, auxiliaryNoteInfo = _merge_results( + scoredNotes, auxiliaryNoteInfo = _merge_results( scoredNotes, - helpfulnessScores, auxiliaryNoteInfo, modelResult.scoredNotes, - modelResult.helpfulnessScores, modelResult.auxiliaryNoteInfo, ) - scoredNotes, helpfulnessScores = coalesce_group_models(scoredNotes, helpfulnessScores) + scoredNotes = coalesce_group_model_scored_notes(scoredNotes) + scoredNotes = coalesce_multi_group_model_scored_notes(scoredNotes) scoredNotes = coalesce_topic_models(scoredNotes) - return scoredNotes, helpfulnessScores, auxiliaryNoteInfo + return scoredNotes, auxiliaryNoteInfo + + +def convert_prescoring_rater_model_output_to_coalesced_helpfulness_scores( + prescoringRaterModelOutput: pd.DataFrame, + userEnrollment: pd.DataFrame, +): + # Join modeling groups from enrollment + prescoringRaterModelOutput = prescoringRaterModelOutput.merge( + userEnrollment[[c.participantIdKey, c.modelingGroupKey]], + left_on=c.raterParticipantIdKey, + right_on=c.participantIdKey, + how="left", + ) + + helpfulnessScores = prescoringRaterModelOutput[ + [ + c.raterParticipantIdKey, + ] + ].drop_duplicates() + + scorersEnumDict = _get_scorers(seed=None, pseudoraters=None) + scorers = chain(*scorersEnumDict.values()) + uniqueScorerNames = prescoringRaterModelOutput[c.scorerNameKey].unique() + for scorer in scorers: + scorerName = scorer.get_name() + if scorerName not in uniqueScorerNames: + continue + + scorerOutputInternalNames = prescoringRaterModelOutput[ + (prescoringRaterModelOutput[c.scorerNameKey] == scorerName) + ] + scorerOutputExternalNames = scorerOutputInternalNames.rename( + columns=scorer._get_user_col_mapping() + ) + if isinstance(scorer, MFGroupScorer): + scorerOutputExternalNames[scorer._modelingGroupKey] = scorer._groupId + # Raters may appear in multiple groups due to authorship -- filter out rows not from this group + scorerOutputExternalNames = scorerOutputExternalNames[ + scorerOutputExternalNames[c.modelingGroupKey].isin(scorer._includedGroups) + ] + + finalCols = scorer.get_helpfulness_scores_cols() + if c.raterParticipantIdKey not in finalCols: + finalCols.append(c.raterParticipantIdKey) + scorerOutputExternalNames = scorerOutputExternalNames[finalCols] + + if isinstance(scorer, MFGroupScorer): + helpfulnessScores = helpfulnessScores.merge( + scorerOutputExternalNames, + on=c.raterParticipantIdKey, + how="outer", + unsafeAllowed=scorer._modelingGroupKey, + ) + else: + helpfulnessScores = helpfulnessScores.merge( + scorerOutputExternalNames, on=c.raterParticipantIdKey, how="outer" + ) + + return helpfulnessScores def meta_score( scorers: Dict[Scorers, List[Scorer]], scoredNotes: pd.DataFrame, auxiliaryNoteInfo: pd.DataFrame, - lockedStatus: pd.DataFrame, + noteStatusHistory: pd.DataFrame, enabledScorers: Optional[Set[Scorers]], + enableNmrDueToMinStableCrhTime: bool = True, ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Determine final note status based on individual scoring results. @@ -496,7 +630,7 @@ def meta_score( Args: scoredNotes: pd.DataFrame containing all scored note results. auxiliaryNoteInfo: pd.DataFrame containing tag aggregates - lockedStatus: pd.DataFrame containing {noteId, status} pairs for all notes + noteStatusHistory: pd.DataFrame containing {noteId, lockedStatus, timestampMillisOfNmrDueToMinStableCrhTime} for all notes enabledScorers: if not None, set of which scorers should be instantiated and enabled Returns: @@ -508,7 +642,13 @@ def meta_score( with c.time_block("Post-scorers: Meta Score: Setup"): assert len(scoredNotes) == len(auxiliaryNoteInfo) scoredNotes = scoredNotes.merge( - auxiliaryNoteInfo[[c.noteIdKey] + c.helpfulTagsTSVOrder + c.notHelpfulTagsTSVOrder], + auxiliaryNoteInfo[ + [c.noteIdKey, c.currentLabelKey] + c.helpfulTagsTSVOrder + c.notHelpfulTagsTSVOrder + ], + on=c.noteIdKey, + ) + scoredNotes = scoredNotes.merge( + noteStatusHistory[[c.noteIdKey, c.timestampMillisOfNmrDueToMinStableCrhTimeKey]], on=c.noteIdKey, ) assert len(scoredNotes) == len(auxiliaryNoteInfo) @@ -545,6 +685,16 @@ def meta_score( c.coreRatingStatusKey, ) ) + if enabledScorers is None or Scorers.MFMultiGroupScorer in enabledScorers: + rules.append( + scoring_rules.ApplyModelResult( + RuleID["MULTI_GROUP_MODEL_1"], + {RuleID.CORE_MODEL}, + c.multiGroupRatingStatusKey, + checkFirmReject=True, + filterColumnPairs=[(c.modelingMultiGroupKey, 1)], + ) + ) if enabledScorers is None or Scorers.MFGroupScorer in enabledScorers: # TODO: modify this code to work when MFExpansionScorer is disabled by the system test assert len(scorers[Scorers.MFCoreScorer]) == 1 @@ -574,7 +724,7 @@ def meta_score( i, None, None, - minSafeguardThreshold=None, + minSafeguardThreshold=0.25, ) ) if enabledScorers is None or Scorers.MFTopicScorer in enabledScorers: @@ -588,10 +738,17 @@ def meta_score( topic, ) ) + if enableNmrDueToMinStableCrhTime: + rules.append( + scoring_rules.NmrDueToMinStableCrhTime( + RuleID.NMR_DUE_TO_MIN_STABLE_CRH_TIME, + {RuleID.CORE_MODEL}, + ) + ) rules.extend( [ scoring_rules.ScoringDriftGuard( - RuleID.SCORING_DRIFT_GUARD, {RuleID.CORE_MODEL}, lockedStatus + RuleID.SCORING_DRIFT_GUARD, {RuleID.CORE_MODEL}, noteStatusHistory ), # TODO: The rule below both sets tags for notes which are CRH / CRNH and unsets status for # any notes which are CRH / CRNH but don't have enough ratings to assign two tags. The later @@ -619,7 +776,23 @@ def meta_score( c.metaScorerActiveRulesKey, decidedByColumn=c.decidedByKey, ) - + if not enableNmrDueToMinStableCrhTime: + scoringResult[c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey] = np.nan + # Validate that nothing that was a FIRM_REJECT or CRNH from Core or Expansion is rated CRH + coreRejects = scoringResult[c.coreRatingStatusKey].isin( + {c.firmReject, c.currentlyRatedNotHelpful} + ) + expansionRejects = scoringResult[c.expansionRatingStatusKey].isin( + {c.firmReject, c.currentlyRatedNotHelpful} + ) + blockedRows = coreRejects | (scoringResult[c.coreRatingStatusKey].isna() & expansionRejects) + crhRows = scoringResult[c.finalRatingStatusKey] == c.currentlyRatedHelpful + logger.info("Summary of blocked and CRH rows:") + # TODO: validate that these are all due to ScoringDriftGuard and change to an assert + logger.info( + scoringResult[blockedRows & crhRows][c.metaScorerActiveRulesKey].value_counts(dropna=False) + ) + logger.info(scoringResult[blockedRows & crhRows][c.decidedByKey].value_counts(dropna=False)) with c.time_block("Post-scorers: Meta Score: Preparing Return Values"): scoredNotesCols = scoringResult[ [ @@ -629,6 +802,7 @@ def meta_score( c.firstTagKey, c.secondTagKey, c.decidedByKey, + c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey, ] ] auxiliaryNoteInfoCols = scoringResult[ @@ -713,6 +887,7 @@ def _compute_helpfulness_scores( with c.time_block("Meta Helpfulness Scorers: Setup"): # Generate a unified view of note scoring information for computing contributor stats assert len(scoredNotes) == len(auxiliaryNoteInfo), "notes in both note inputs must match" + scoredNotesWithStats = scoredNotes.merge( # noteId and timestamp are the only common fields, and should always be equal. auxiliaryNoteInfo, @@ -768,6 +943,7 @@ def _compute_helpfulness_scores( ], on=c.raterParticipantIdKey, how="outer", + unsafeAllowed={c.enrollmentState, c.isEmergingWriterKey}, ) contributorScores = contributor_state.single_trigger_earn_out(contributorScores) contributorScores = contributor_state.calculate_ri_to_earn_in(contributorScores) @@ -784,6 +960,7 @@ def _compute_helpfulness_scores( left_on=c.raterParticipantIdKey, right_on=c.participantIdKey, how="left", + unsafeAllowed=(c.enrollmentState + "_prev"), ).drop(c.participantIdKey, axis=1) # For users who did not earn a new enrollmentState, carry over the previous one @@ -813,22 +990,22 @@ def _add_deprecated_columns(scoredNotes: pd.DataFrame) -> pd.DataFrame: scoredNotes[column] = np.nan elif columnType == str: scoredNotes[column] = "" + elif columnType == "category": + scoredNotes[column] = np.nan else: assert False, f"column type {columnType} unsupported" return scoredNotes -def _validate( +def _validate_note_scoring_output( scoredNotes: pd.DataFrame, - helpfulnessScores: pd.DataFrame, noteStatusHistory: pd.DataFrame, auxiliaryNoteInfo: pd.DataFrame, -) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]: +) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: """Guarantee that each dataframe has the expected columns in the correct order. Args: scoredNotes (pd.DataFrame): notes with scores returned by MF scoring algorithm - helpfulnessScores (pd.DataFrame): BasicReputation scores for all raters noteStatusHistory (pd.DataFrame): one row per note; history of when note had each status auxiliaryNoteInfo (pd.DataFrame): additional fields generated during note scoring @@ -839,10 +1016,6 @@ def _validate( c.noteModelOutputTSVColumns ), f"Got {sorted(scoredNotes.columns)}, expected {sorted(c.noteModelOutputTSVColumns)}" scoredNotes = scoredNotes[c.noteModelOutputTSVColumns] - assert set(helpfulnessScores.columns) == set( - c.raterModelOutputTSVColumns - ), f"Got {sorted(helpfulnessScores.columns)}, expected {sorted(c.raterModelOutputTSVColumns)}" - helpfulnessScores = helpfulnessScores[c.raterModelOutputTSVColumns] assert set(noteStatusHistory.columns) == set( c.noteStatusHistoryTSVColumns ), f"Got {sorted(noteStatusHistory.columns)}, expected {sorted(c.noteStatusHistoryTSVColumns)}" @@ -851,7 +1024,15 @@ def _validate( c.auxiliaryScoredNotesTSVColumns ), f"Got {sorted(auxiliaryNoteInfo.columns)}, expected {sorted(c.auxiliaryScoredNotesTSVColumns)}" auxiliaryNoteInfo = auxiliaryNoteInfo[c.auxiliaryScoredNotesTSVColumns] - return (scoredNotes, helpfulnessScores, noteStatusHistory, auxiliaryNoteInfo) + return (scoredNotes, noteStatusHistory, auxiliaryNoteInfo) + + +def _validate_contributor_scoring_output(helpfulnessScores: pd.DataFrame) -> pd.DataFrame: + assert set(helpfulnessScores.columns) == set( + c.raterModelOutputTSVColumns + ), f"Got {sorted(helpfulnessScores.columns)}, expected {sorted(c.raterModelOutputTSVColumns)}" + helpfulnessScores = helpfulnessScores[c.raterModelOutputTSVColumns] + return helpfulnessScores def run_prescoring( @@ -864,18 +1045,68 @@ def run_prescoring( runParallel: bool = True, dataLoader: Optional[CommunityNotesDataLoader] = None, useStableInitialization: bool = True, -) -> Tuple[pd.DataFrame, pd.DataFrame]: + pseudoraters: bool = True, + checkFlips: bool = True, + enableNmrDueToMinStableCrhTime: bool = True, + previousRatingCutoffTimestampMillis: Optional[int] = None, +) -> Tuple[ + pd.DataFrame, pd.DataFrame, sklearn.pipeline.Pipeline, c.PrescoringMetaOutput, pd.DataFrame +]: + with c.time_block("Logging Prescoring Inputs Initial RAM usage"): + logger.info(get_df_info(notes, "notes")) + logger.info(get_df_info(ratings, "ratings")) + logger.info(get_df_info(noteStatusHistory, "noteStatusHistory")) + logger.info(get_df_info(userEnrollment, "userEnrollment")) with c.time_block("Note Topic Assignment"): topicModel = TopicModel() - noteTopics = topicModel.get_note_topics(notes) + noteTopicClassifierPipe, seedLabels, conflictedTexts = topicModel.train_note_topic_classifier( + notes + ) + noteTopics = topicModel.get_note_topics( + notes, noteTopicClassifierPipe, seedLabels, conflictedTextsForAccuracyEval=conflictedTexts + ) + + with c.time_block("Compute Post Selection Similarity"): + pss = PostSelectionSimilarity(notes, ratings) + postSelectionSimilarityValues = pss.get_post_selection_similarity_values() + logger.info(f"Post Selection Similarity Prescoring: begin with {len(ratings)} ratings.") + ratings = filter_ratings_by_post_selection_similarity( + notes, ratings, postSelectionSimilarityValues + ) + logger.info(f"Post Selection Similarity Prescoring: {len(ratings)} ratings remaining.") + del pss + gc.collect() scorers = _get_scorers( seed=seed, pseudoraters=False, - enabledScorers=enabledScorers, useStableInitialization=useStableInitialization, ) + # Attempt to convert IDs to Int64 before prescoring. We expect this to succeed in production, + # fail when running on public data and fail in some unit tests. + conversion = False + try: + # Complete all three conversions before doing any updates, so if there are any errors the + # updates don't happen. + ratingIds = ratings[c.raterParticipantIdKey].astype(pd.Int64Dtype()) + noteStatusHistoryIds = noteStatusHistory[c.noteAuthorParticipantIdKey].astype(pd.Int64Dtype()) + userEnrollmentIds = userEnrollment[c.participantIdKey].astype(pd.Int64Dtype()) + ratings[c.raterParticipantIdKey] = ratingIds + noteStatusHistory[c.noteAuthorParticipantIdKey] = noteStatusHistoryIds + userEnrollment[c.participantIdKey] = userEnrollmentIds + del ratingIds, noteStatusHistoryIds, userEnrollmentIds + logger.info( + "User IDs for ratings, noteStatusHistory and userEnrollment converted to Int64Dtype." + ) + conversion = True + except ValueError as e: + logger.info(f"Error converting user IDs to ints. IDs will remain as strings. {repr(e)}") + with c.time_block("Logging Prescoring Inputs RAM usage before _run_scorers"): + logger.info(get_df_info(notes, "notes")) + logger.info(get_df_info(ratings, "ratings")) + logger.info(get_df_info(noteStatusHistory, "noteStatusHistory")) + logger.info(get_df_info(userEnrollment, "userEnrollment")) prescoringModelResultsFromAllScorers = _run_scorers( scorers=list(chain(*scorers.values())), scoringArgs=PrescoringArgs( @@ -891,20 +1122,289 @@ def run_prescoring( # scorer (i.e. we would not finish faster with >6 worker processes.) maxWorkers=6, ) - ( prescoringNoteModelOutput, prescoringRaterModelOutput, + prescoringMetaOutput, ) = combine_prescorer_scorer_results(prescoringModelResultsFromAllScorers) + del prescoringModelResultsFromAllScorers + del scorers + gc.collect() + + with c.time_block("Logging Prescoring Results RAM usage (before conversion)"): + logger.info(get_df_info(notes, "notes")) + logger.info(get_df_info(ratings, "ratings")) + logger.info(get_df_info(noteStatusHistory, "noteStatusHistory")) + logger.info(get_df_info(userEnrollment, "userEnrollment")) + logger.info(get_df_info(prescoringNoteModelOutput, "prescoringNoteModelOutput")) + logger.info(get_df_info(prescoringRaterModelOutput, "prescoringRaterModelOutput")) + # Restore IDs as string objects now that prescoring is over and memory pressure is relaxed. + if conversion: + logger.info("Restoring string IDs.") + ratings[c.raterParticipantIdKey] = ratings[c.raterParticipantIdKey].astype(str) + noteStatusHistory[c.noteAuthorParticipantIdKey] = noteStatusHistory[ + c.noteAuthorParticipantIdKey + ].astype(str) + userEnrollment[c.participantIdKey] = userEnrollment[c.participantIdKey].astype(str) + # Notice that we also do conversion on the prescoring results. + prescoringRaterModelOutput[c.raterParticipantIdKey] = prescoringRaterModelOutput[ + c.raterParticipantIdKey + ].astype(str) + logger.info("Restoration of original string IDs complete.") + + with c.time_block("Logging Prescoring Results RAM usage (after conversion)"): + logger.info(get_df_info(notes, "notes")) + logger.info(get_df_info(ratings, "ratings")) + logger.info(get_df_info(noteStatusHistory, "noteStatusHistory")) + logger.info(get_df_info(userEnrollment, "userEnrollment")) + logger.info(get_df_info(prescoringNoteModelOutput, "prescoringNoteModelOutput")) + logger.info(get_df_info(prescoringRaterModelOutput, "prescoringRaterModelOutput")) + + prescoringRaterModelOutput = pd.concat( + [prescoringRaterModelOutput, postSelectionSimilarityValues], + unsafeAllowed={ + c.postSelectionValueKey, + }, + ) + with c.time_block("Logging Prescoring Results RAM usage (after concatenation)"): + logger.info(get_df_info(prescoringRaterModelOutput, "prescoringRaterModelOutput")) + + # Prescoring itself is now done. We will not run final_note_scoring to check note status flips. + if checkFlips: + # Rescore a smaller set of notes, since we are only using these note statuses to check for flips. + # Rescore a only unlocked notes. (In the future, we could randomly sample a subset of these) + noteStatusHistoryToRescore = noteStatusHistory[ + noteStatusHistory[c.timestampMillisOfStatusLockKey].isna() + ] + + notesToRescoreSet = set(noteStatusHistoryToRescore[c.noteIdKey]) + ratingsToRescore = ratings[ratings["noteId"].isin(notesToRescoreSet)].copy() + notesToRescore = notes[notes["noteId"].isin(notesToRescoreSet)].copy() + + scoredNotes, _, _, _ = run_final_note_scoring( + notes=notesToRescore, + ratings=ratingsToRescore, + noteStatusHistory=noteStatusHistoryToRescore, + userEnrollment=userEnrollment, + seed=seed, + pseudoraters=pseudoraters, + enabledScorers=enabledScorers, + runParallel=runParallel, + useStableInitialization=useStableInitialization, + prescoringNoteModelOutput=prescoringNoteModelOutput, + prescoringRaterModelOutput=prescoringRaterModelOutput, + noteTopicClassifier=noteTopicClassifierPipe, + prescoringMetaOutput=prescoringMetaOutput, + checkFlips=checkFlips, + enableNmrDueToMinStableCrhTime=enableNmrDueToMinStableCrhTime, + previousRatingCutoffTimestampMillis=previousRatingCutoffTimestampMillis, + ) + else: + scoredNotes = None - return prescoringNoteModelOutput, prescoringRaterModelOutput + return ( + prescoringNoteModelOutput, + prescoringRaterModelOutput, + noteTopicClassifierPipe, + prescoringMetaOutput, + scoredNotes, + ) -def run_final_scoring( +def run_contributor_scoring( + ratings: pd.DataFrame, + scoredNotes: pd.DataFrame, + auxiliaryNoteInfo: pd.DataFrame, + prescoringRaterModelOutput: pd.DataFrame, + noteStatusHistory: pd.DataFrame, + userEnrollment: pd.DataFrame, + strictColumns: bool = True, +) -> pd.DataFrame: + helpfulnessScores = convert_prescoring_rater_model_output_to_coalesced_helpfulness_scores( + prescoringRaterModelOutput, userEnrollment + ) + helpfulnessScores = coalesce_group_model_helpfulness_scores(helpfulnessScores) + helpfulnessScores = coalesce_multi_group_model_helpfulness_scores(helpfulnessScores) + + # Compute contribution statistics and enrollment state for users. + with c.time_block("Post-scorers: Compute helpfulness scores"): + helpfulnessScores = _compute_helpfulness_scores( + ratings, + scoredNotes, + auxiliaryNoteInfo, + helpfulnessScores, + noteStatusHistory, + userEnrollment, + ) + if strictColumns: + helpfulnessScores = _validate_contributor_scoring_output(helpfulnessScores) + return helpfulnessScores + + +def determine_which_notes_to_rescore( + notes: pd.DataFrame, + ratings: pd.DataFrame, + noteStatusHistory: pd.DataFrame, + previousRatingCutoffTimestampMillis: Optional[int] = None, + scoreRecentNotesMinimumFrequencyMillis: Optional[int] = 1000 * 60 * 60 * 24, # 1 day + recentNotesAgeCutoffMillis: Optional[int] = 1000 * 60 * 60 * 24 * 14, # 14 days, + scoreRecentlyFlippedNotesMinimumFrequencyMillis: Optional[int] = 1000 * 60 * 60 * 1, # 1 hour + recentlyFlippedNoteAgeCutoffMillis: Optional[int] = 1000 * 60 * 60 * 24, # 1 day +) -> Tuple[List[c.NoteSubset], set]: + notesToRescoreSet = set() + noteSubsets = [] + + # 1. Rescore all notes with a new rating since last scoring run. + if previousRatingCutoffTimestampMillis is not None: + notesWithNewRatings = set( + ratings.loc[ratings[c.createdAtMillisKey] > previousRatingCutoffTimestampMillis, c.noteIdKey] + ) + logger.info( + f"1. Num notes with new ratings since last scoring run (ts: {previousRatingCutoffTimestampMillis}): {len(notesWithNewRatings)}" + ) + notesToRescoreSet.update(notesWithNewRatings) + else: + notesWithNewRatings = set() + noteSubsets.append( + c.NoteSubset( + noteSet=notesWithNewRatings, + maxNewCrhChurnRate=c.finalNotesWithNewRatingsMaxNewCrhChurn, + maxOldCrhChurnRate=c.finalNotesWithNewRatingsMaxOldCrhChurn, + description=c.RescoringRuleID.NOTES_WITH_NEW_RATINGS, + ) + ) + + currentMillis = int(time.time() * 1000) + + # 2. Rescore all recently created notes if not rescored at the minimum frequency. + if recentNotesAgeCutoffMillis is not None and scoreRecentNotesMinimumFrequencyMillis is not None: + noteCreatedRecently = ( + noteStatusHistory[c.createdAtMillisKey] > currentMillis - recentNotesAgeCutoffMillis + ) + noteNotRescoredRecently = ( + noteStatusHistory[c.timestampMillisOfNoteCurrentLabelKey] + < currentMillis - scoreRecentNotesMinimumFrequencyMillis + ) + newNotesNotRescoredRecentlyEnough = set( + noteStatusHistory.loc[noteCreatedRecently & noteNotRescoredRecently, c.noteIdKey] + ) + logger.info("2. Rescore all recently created notes if not rescored at the minimum frequency.") + logger.info(f"Num notes created recently: {noteCreatedRecently.sum()}") + # Remove notes with new ratings from this set. + newNotesNotRescoredRecentlyEnough = newNotesNotRescoredRecentlyEnough.difference( + notesWithNewRatings + ) + notesToRescoreSet.update(newNotesNotRescoredRecentlyEnough) + else: + newNotesNotRescoredRecentlyEnough = set() + noteSubsets.append( + c.NoteSubset( + noteSet=newNotesNotRescoredRecentlyEnough, + maxNewCrhChurnRate=c.finalUnlockedNotesWithNoNewRatingsMaxCrhChurn, + maxOldCrhChurnRate=c.finalUnlockedNotesWithNoNewRatingsMaxCrhChurn, + description=c.RescoringRuleID.NEW_NOTES_NOT_RESCORED_RECENTLY_ENOUGH, + ) + ) + + # 3. Rescore all notes that flipped status in the previous scoring run. + justFlippedNotes = set( + noteStatusHistory.loc[ + ( + noteStatusHistory[c.timestampMillisOfMostRecentStatusChangeKey] + == noteStatusHistory[c.timestampMillisOfNoteCurrentLabelKey] + ), + c.noteIdKey, + ] + ).difference(notesWithNewRatings) + logger.info( + f"3. Rescore all notes that flipped status in the previous scoring run. {len(justFlippedNotes)}" + ) + notesToRescoreSet.update(justFlippedNotes) + noteSubsets.append( + c.NoteSubset( + noteSet=justFlippedNotes, + maxNewCrhChurnRate=c.finalNotesThatJustFlippedStatusMaxCrhChurn, + maxOldCrhChurnRate=c.finalNotesThatJustFlippedStatusMaxCrhChurn, + description=c.RescoringRuleID.NOTES_FLIPPED_PREVIOUS_RUN, + ) + ) + + # 4. Rescore all recently-flipped notes if not rescored at the minimum frequency. + if ( + recentlyFlippedNoteAgeCutoffMillis is not None + and scoreRecentlyFlippedNotesMinimumFrequencyMillis is not None + ): + noteFlippedRecently = ( + noteStatusHistory[c.timestampMillisOfMostRecentStatusChangeKey] + > currentMillis - recentlyFlippedNoteAgeCutoffMillis + ) + noteNotRescoredRecently = ( + noteStatusHistory[c.timestampMillisOfNoteCurrentLabelKey] + < currentMillis - scoreRecentlyFlippedNotesMinimumFrequencyMillis + ) + logger.info("4. Rescore all recently-flipped notes if not rescored at the minimum frequency.") + logger.info(f"Num notes flipped recently: {noteFlippedRecently.sum()}") + logger.info(f"Num notes not rescored recently enough: {noteNotRescoredRecently.sum()}") + recentlyFlippedNotesNotRescoredRecentlyEnough = set( + noteStatusHistory.loc[noteFlippedRecently & noteNotRescoredRecently, c.noteIdKey] + ) + notesToRescoreSet.update(recentlyFlippedNotesNotRescoredRecentlyEnough) + else: + recentlyFlippedNotesNotRescoredRecentlyEnough = set() + noteSubsets.append( + c.NoteSubset( + noteSet=recentlyFlippedNotesNotRescoredRecentlyEnough, + maxNewCrhChurnRate=c.finalNotesThatFlippedRecentlyMaxCrhChurn, + maxOldCrhChurnRate=c.finalNotesThatFlippedRecentlyMaxCrhChurn, + description=c.RescoringRuleID.RECENTLY_FLIPPED_NOTES_NOT_RESCORED_RECENTLY_ENOUGH, + ) + ) + + # 5. Rescore all notes that were NMRed due to MinStableCrhTime was not met. + nmrDueToMinStableCrhTimeNotes = set( + noteStatusHistory.loc[ + ( + ~noteStatusHistory[c.timestampMillisOfNmrDueToMinStableCrhTimeKey].isna() + & (noteStatusHistory[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] > 0) + ), + c.noteIdKey, + ] + ) + logger.info( + f"5. Rescore all notes that were NMRed due to MinStableCrhTime was not met. {len(nmrDueToMinStableCrhTimeNotes)}" + ) + notesToRescoreSet.update(nmrDueToMinStableCrhTimeNotes) + noteSubsets.append( + c.NoteSubset( + noteSet=nmrDueToMinStableCrhTimeNotes, + maxNewCrhChurnRate=c.finalNotesNmrDueToMinStableCrhTimeMaxNewCrhChurn, + maxOldCrhChurnRate=c.finalNotesNmrDueToMinStableCrhTimeMaxOldCrhChurn, + description=c.RescoringRuleID.NMR_DUE_TO_MIN_STABLE_CRH_TIME, + ) + ) + + logger.info( + f"""----\nNotes to rescore: + * {len(notesWithNewRatings)} notes with new ratings since last scoring run. + * {len(newNotesNotRescoredRecentlyEnough)} notes created recently and not rescored recently enough. + * {len(justFlippedNotes)} notes that flipped status in the previous scoring run. + * {len(recentlyFlippedNotesNotRescoredRecentlyEnough)} notes that flipped status recently and not rescored recently enough. + * {len(nmrDueToMinStableCrhTimeNotes)} notes that were NMRed due to MinStableCrhTime was not met. + Overall: {len(notesToRescoreSet)} notes to rescore, out of {len(notes)} total.\n----""" + ) + + return noteSubsets, notesToRescoreSet + + +def run_final_note_scoring( notes: pd.DataFrame, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, userEnrollment: pd.DataFrame, + prescoringNoteModelOutput: pd.DataFrame, + prescoringRaterModelOutput: pd.DataFrame, + noteTopicClassifier: sklearn.pipeline.Pipeline, + prescoringMetaOutput: c.PrescoringMetaOutput, seed: Optional[int] = None, pseudoraters: Optional[bool] = True, enabledScorers: Optional[Set[Scorers]] = None, @@ -912,16 +1412,123 @@ def run_final_scoring( runParallel: bool = True, dataLoader: Optional[CommunityNotesDataLoader] = None, useStableInitialization: bool = True, - prescoringNoteModelOutput: Optional[pd.DataFrame] = None, - prescoringRaterModelOutput: Optional[pd.DataFrame] = None, + checkFlips: bool = True, + previousScoredNotes: Optional[pd.DataFrame] = None, + previousAuxiliaryNoteInfo: Optional[pd.DataFrame] = None, + previousRatingCutoffTimestampMillis: Optional[int] = 0, + enableNmrDueToMinStableCrhTime: bool = True, ): + metrics = {} + with c.time_block("Logging Final Scoring RAM usage"): + logger.info(get_df_info(notes, "notes")) + logger.info(get_df_info(ratings, "ratings")) + logger.info(get_df_info(noteStatusHistory, "noteStatusHistory")) + logger.info(get_df_info(userEnrollment, "userEnrollment")) + logger.info(get_df_info(prescoringNoteModelOutput, "prescoringNoteModelOutput")) + logger.info(get_df_info(prescoringRaterModelOutput, "prescoringRaterModelOutput")) + with c.time_block("Determine which notes to score."): + if previousScoredNotes is None: + logger.info("No previous scored notes passed; scoring all notes.") + notesToRescoreSet: Set[int] = set() + scoredNotesPassthrough = None + currentMillis = int(time.time() * 1000) + recentNotesAgeTooOldCutoffMillis = ( + 1000 * 60 * 60 * 24 * 13 + ) # 13 days: one less than final scoring to avoid boundary issues + recentNotesAgeTooRecentCutoffMillis = ( + 1000 * 60 * 60 * 24 * 3 + ) # 2 days, to avoid notes with too many new ratings + + noteSubsets: List[c.NoteSubset] = [ + c.NoteSubset( + noteSet=None, + maxNewCrhChurnRate=c.prescoringAllUnlockedNotesMaxCrhChurn, + maxOldCrhChurnRate=c.prescoringAllUnlockedNotesMaxCrhChurn, + description=c.RescoringRuleID.ALL_NOTES, + ), + c.NoteSubset( + noteSet=set( + noteStatusHistory.loc[ + ( + ( + noteStatusHistory[c.createdAtMillisKey] + >= currentMillis - recentNotesAgeTooOldCutoffMillis + ) + & ( + noteStatusHistory[c.createdAtMillisKey] + < currentMillis - recentNotesAgeTooRecentCutoffMillis + ) + ), + c.noteIdKey, + ] + ), + maxNewCrhChurnRate=c.prescoringAllNotesCreatedThreeToThirteenDaysAgoMaxChurn, + maxOldCrhChurnRate=c.prescoringAllNotesCreatedThreeToThirteenDaysAgoMaxChurn, + description=c.RescoringRuleID.NOTES_CREATED_SOMEWHAT_RECENTLY, + ), + ] + + noteSubsetsForProdScoring, _ = determine_which_notes_to_rescore( + notes, ratings, noteStatusHistory, previousRatingCutoffTimestampMillis + ) + for noteSubset in noteSubsetsForProdScoring: + if noteSubset.description == c.RescoringRuleID.NEW_NOTES_NOT_RESCORED_RECENTLY_ENOUGH: + noteSubsets.append(noteSubset) + else: + assert previousAuxiliaryNoteInfo is not None + assert previousRatingCutoffTimestampMillis is not None + logger.info("Previous scored notes passed; determining which notes to rescore.") + # Filter all datasets to smaller versions which only contain notes which need to be scored. + noteSubsets, notesToRescoreSet = determine_which_notes_to_rescore( + notes, ratings, noteStatusHistory, previousRatingCutoffTimestampMillis + ) + + scoredNotesPassthrough = previousScoredNotes[ + ~previousScoredNotes[c.noteIdKey].isin(notesToRescoreSet) + ] + auxiliaryNoteInfoPassthrough = previousAuxiliaryNoteInfo[ + ~previousAuxiliaryNoteInfo[c.noteIdKey].isin(notesToRescoreSet) + ] + noteStatusHistoryPassthrough = noteStatusHistory[ + ~noteStatusHistory[c.noteIdKey].isin(notesToRescoreSet) + ] + + logger.info( + f"Rescoring {len(notesToRescoreSet)} notes, out of {len(notes)} total. Original number of ratings: {len(ratings)}" + ) + metrics["num_notes_to_rescore"] = len(notesToRescoreSet) + + # Filter all datasets to only contain notes which need to be scored. + notes = notes[notes[c.noteIdKey].isin(notesToRescoreSet)] + ratings = ratings[ratings[c.noteIdKey].isin(notesToRescoreSet)] + noteStatusHistory = noteStatusHistory[noteStatusHistory[c.noteIdKey].isin(notesToRescoreSet)] + prescoringNoteModelOutput = prescoringNoteModelOutput[ + prescoringNoteModelOutput[c.noteIdKey].isin(notesToRescoreSet) + ] + + logger.info(f"Ratings on notes to rescore: {len(ratings)}") + metrics["num_ratings_on_notes_to_rescore"] = len(ratings) + metrics["latest_rating_created_ms"] = ratings["createdAtMillis"].max() + + with c.time_block("Preprocess smaller dataset since we skipped preprocessing at read time"): + notes, ratings, noteStatusHistory = preprocess_data(notes, ratings, noteStatusHistory) + with c.time_block("Note Topic Assignment"): topicModel = TopicModel() - noteTopics = topicModel.get_note_topics(notes) + noteTopics = topicModel.get_note_topics(notes, noteTopicClassifier) - scorers = _get_scorers( - seed, pseudoraters, enabledScorers, useStableInitialization=useStableInitialization - ) + with c.time_block("Post Selection Similarity: Final Scoring"): + logger.info(f"Post Selection Similarity Final Scoring: begin with {len(ratings)} ratings.") + ratings = filter_ratings_by_post_selection_similarity( + notes, + ratings, + prescoringRaterModelOutput[prescoringRaterModelOutput[c.postSelectionValueKey] >= 1][ + [c.raterParticipantIdKey, c.postSelectionValueKey] + ], + ) + logger.info(f"Post Selection Similarity Final Scoring: {len(ratings)} ratings remaining.") + + scorers = _get_scorers(seed, pseudoraters, useStableInitialization=useStableInitialization) modelResults = _run_scorers( scorers=list(chain(*scorers.values())), @@ -932,6 +1539,7 @@ def run_final_scoring( userEnrollment, prescoringNoteModelOutput=prescoringNoteModelOutput, prescoringRaterModelOutput=prescoringRaterModelOutput, + prescoringMetaOutput=prescoringMetaOutput, ), runParallel=runParallel, dataLoader=dataLoader, @@ -941,33 +1549,62 @@ def run_final_scoring( maxWorkers=6, ) - scoredNotes, helpfulnessScores, auxiliaryNoteInfo = combine_final_scorer_results( - modelResults, noteStatusHistory - ) + scoredNotes, auxiliaryNoteInfo = combine_final_scorer_results(modelResults, noteStatusHistory) - return post_scoring( + scoredNotes, newNoteStatusHistory, auxiliaryNoteInfo = post_note_scoring( scorers, scoredNotes, - helpfulnessScores, auxiliaryNoteInfo, ratings, noteStatusHistory, - userEnrollment, + noteSubsets, enabledScorers, strictColumns, + checkFlips, + enableNmrDueToMinStableCrhTime, ) + # Concat final scoring results for newly-scored notes with the results for old notes not scores. + if scoredNotesPassthrough is not None: + # Convert scoredNotes dtypes to match scoredNotesPassthrough + for column, targetDtype in c.noteModelOutputTSVTypeMapping.items(): + if column in scoredNotes.columns: + if targetDtype == pd.BooleanDtype(): + # Due to current Python version in prod, we cannot interpret pd.BooleanDtype() as a datatype yet. + continue + if scoredNotes[column].dtype != targetDtype: + scoredNotes[column] = scoredNotes[column].astype(targetDtype) + scoredNotesPassthrough[c.rescoringActiveRulesKey] = "" + scoredNotes = pd.concat( + [scoredNotes, scoredNotesPassthrough], + unsafeAllowed=[c.topicNoteConfidentKey], # concat 'O' with BooleanDtype + ) + + # Convert auxiliaryNoteInfo dtypes to match auxiliaryNoteInfoPassthrough + for column, targetDtype in c.auxiliaryScoredNotesTSVTypeMapping.items(): + if column in auxiliaryNoteInfo.columns: + if auxiliaryNoteInfo[column].dtype != targetDtype: + auxiliaryNoteInfo[column] = auxiliaryNoteInfo[column].astype(targetDtype) + auxiliaryNoteInfo = pd.concat( + [auxiliaryNoteInfo, auxiliaryNoteInfoPassthrough], + ) + + newNoteStatusHistory = pd.concat([newNoteStatusHistory, noteStatusHistoryPassthrough]) + + return scoredNotes, newNoteStatusHistory, auxiliaryNoteInfo, metrics -def post_scoring( + +def post_note_scoring( scorers: Dict[Scorers, List[Scorer]], scoredNotes: pd.DataFrame, - helpfulnessScores: pd.DataFrame, auxiliaryNoteInfo: pd.DataFrame, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, - userEnrollment: pd.DataFrame, + noteSubsetsAndMaxFlipRates: List[c.NoteSubset], enabledScorers: Optional[Set[Scorers]] = None, strictColumns: bool = True, + checkFlips: bool = True, + enableNmrDueToMinStableCrhTime: bool = True, ): """ Apply individual scoring models and obtained merged result. @@ -988,8 +1625,11 @@ def post_scoring( scorers, scoredNotes, auxiliaryNoteInfo, - noteStatusHistory[[c.noteIdKey, c.lockedStatusKey]], + noteStatusHistory[ + [c.noteIdKey, c.lockedStatusKey, c.timestampMillisOfNmrDueToMinStableCrhTimeKey] + ], enabledScorers, + enableNmrDueToMinStableCrhTime, ) with c.time_block("Post-scorers: Join scored notes"): @@ -1005,17 +1645,31 @@ def post_scoring( noteStatusHistory ), "noteStatusHistory should be complete, and all notes should be scored." - # Compute contribution statistics and enrollment state for users. - with c.time_block("Post-scorers: Compute helpfulness scores"): - helpfulnessScores = _compute_helpfulness_scores( - ratings, scoredNotes, auxiliaryNoteInfo, helpfulnessScores, noteStatusHistory, userEnrollment - ) - - # Merge scoring results into noteStatusHistory. + # Merge scoring results into noteStatusHistory, check flip rates, and set rescoringActiveRules. with c.time_block("Post-scorers: Update note status history"): - newNoteStatusHistory = note_status_history.update_note_status_history( + mergedNoteStatuses = note_status_history.merge_old_and_new_note_statuses( noteStatusHistory, scoredNotes ) + # Not needed anymore, has been merged into note_status_history. + scoredNotes = scoredNotes.drop(columns=[c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey]) + + scoredNotes[c.rescoringActiveRulesKey] = "" + for noteSubset in noteSubsetsAndMaxFlipRates: + if checkFlips: + note_status_history.check_flips(mergedNoteStatuses, noteSubset=noteSubset) + if noteSubset.noteSet is not None: + noteInSetMask = scoredNotes[c.noteIdKey].isin(noteSubset.noteSet) + else: + noteInSetMask = scoredNotes[c.noteIdKey].notnull() # All notes by default. + scoredNotes.loc[noteInSetMask, c.rescoringActiveRulesKey] = scoredNotes.loc[ + noteInSetMask, c.rescoringActiveRulesKey + ].apply( + lambda rescoringActiveRules: rescoringActiveRules + noteSubset.description.name + if len(rescoringActiveRules) == 0 + else f"{rescoringActiveRules},{noteSubset.description.name}" + ) + + newNoteStatusHistory = note_status_history.update_note_status_history(mergedNoteStatuses) assert len(newNoteStatusHistory) == len( noteStatusHistory ), "noteStatusHistory should contain all notes after preprocessing" @@ -1024,12 +1678,14 @@ def post_scoring( with c.time_block("Post-scorers: finalize output columns"): scoredNotes = _add_deprecated_columns(scoredNotes) if strictColumns: - scoredNotes, helpfulnessScores, newNoteStatusHistory, auxiliaryNoteInfo = _validate( - scoredNotes, helpfulnessScores, newNoteStatusHistory, auxiliaryNoteInfo + (scoredNotes, newNoteStatusHistory, auxiliaryNoteInfo) = _validate_note_scoring_output( + scoredNotes, newNoteStatusHistory, auxiliaryNoteInfo ) - print(f"Meta scoring elapsed time: {((time.time() - postScoringStartTime)/60.0):.2f} minutes.") - return scoredNotes, helpfulnessScores, newNoteStatusHistory, auxiliaryNoteInfo + logger.info( + f"Meta scoring elapsed time: {((time.time() - postScoringStartTime)/60.0):.2f} minutes." + ) + return scoredNotes, newNoteStatusHistory, auxiliaryNoteInfo def run_scoring( @@ -1045,14 +1701,30 @@ def run_scoring( dataLoader: Optional[CommunityNotesDataLoader] = None, useStableInitialization: bool = True, writePrescoringScoringOutputCallback: Optional[ - Callable[[pd.DataFrame, pd.DataFrame], None] + Callable[ + [ + pd.DataFrame, + pd.DataFrame, + sklearn.pipeline.Pipeline, + c.PrescoringMetaOutput, + Optional[pd.DataFrame], + ], + None, + ] ] = None, + cutoffTimestampMillis: Optional[int] = None, + excludeRatingsAfterANoteGotFirstStatusPlusNHours: Optional[int] = None, + daysInPastToApplyPostFirstStatusFiltering: Optional[int] = 14, filterPrescoringInputToSimulateDelayInHours: Optional[int] = None, + checkFlips: bool = True, + previousScoredNotes: Optional[pd.DataFrame] = None, + previousAuxiliaryNoteInfo: Optional[pd.DataFrame] = None, + previousRatingCutoffTimestampMillis: Optional[int] = 0, ): """Runs both phases of scoring consecutively. Only for adhoc/testing use. In prod, we run each phase as a separate binary. - Wrapper around run_prescoring and run_final_scoring. + Wrapper around run_prescoring, run_final_note_scoring, and run_contributor_scoring. Invokes note scoring algorithms, merges results and computes user stats. @@ -1077,29 +1749,28 @@ def run_scoring( noteStatusHistory pd.DataFrame: one row per note containing when they got their most recent statuses. auxiliaryNoteInfo: one row per note containing adjusted and ratio tag values """ - - # Filter input data for prescoring to simulate running prescoring earlier than final scoring - if filterPrescoringInputToSimulateDelayInHours is not None: - latestRatingMillis = ratings[c.createdAtMillisKey].max() - cutoffMillis = latestRatingMillis - ( - filterPrescoringInputToSimulateDelayInHours * 60 * 60 * 1000 - ) - print( - f""" - Filtering input data for prescoring to simulate running prescoring earlier than final scoring. - Latest rating timestamp: {pd.to_datetime(latestRatingMillis, unit='ms')} - Cutoff timestamp: {pd.to_datetime(cutoffMillis, unit='ms')} ({filterPrescoringInputToSimulateDelayInHours} hours before) - """ - ) - prescoringNotesInput = notes[notes[c.createdAtMillisKey] < cutoffMillis].copy() - prescoringRatingsInput = ratings[ratings[c.createdAtMillisKey] < cutoffMillis].copy() - else: - prescoringNotesInput = notes - prescoringRatingsInput = ratings + # Filter input data for testing if optional args present. Else, do nothing. + ( + notes, + ratings, + prescoringNotesInput, + prescoringRatingsInput, + ) = filter_input_data_for_testing( + notes, + ratings, + noteStatusHistory, + cutoffTimestampMillis, + excludeRatingsAfterANoteGotFirstStatusPlusNHours, + daysInPastToApplyPostFirstStatusFiltering, + filterPrescoringInputToSimulateDelayInHours, + ) ( prescoringNoteModelOutput, prescoringRaterModelOutput, + prescoringNoteTopicClassifier, + prescoringMetaOutput, + prescoringScoredNotes, ) = run_prescoring( notes=prescoringNotesInput, ratings=prescoringRatingsInput, @@ -1110,15 +1781,23 @@ def run_scoring( runParallel=runParallel, dataLoader=dataLoader, useStableInitialization=useStableInitialization, + checkFlips=False, + previousRatingCutoffTimestampMillis=previousRatingCutoffTimestampMillis, ) - print("We invoked run_scoring and are now in between prescoring and scoring.") + logger.info("We invoked run_scoring and are now in between prescoring and scoring.") if writePrescoringScoringOutputCallback is not None: with c.time_block("Writing prescoring output."): - writePrescoringScoringOutputCallback(prescoringNoteModelOutput, prescoringRaterModelOutput) - print("Starting final scoring") + writePrescoringScoringOutputCallback( + prescoringNoteModelOutput, + prescoringRaterModelOutput, + prescoringNoteTopicClassifier, + prescoringMetaOutput, + prescoringScoredNotes, + ) + logger.info("Starting final scoring") - return run_final_scoring( + scoredNotes, newNoteStatusHistory, auxiliaryNoteInfo, _ = run_final_note_scoring( notes=notes, ratings=ratings, noteStatusHistory=noteStatusHistory, @@ -1132,4 +1811,24 @@ def run_scoring( useStableInitialization=useStableInitialization, prescoringNoteModelOutput=prescoringNoteModelOutput, prescoringRaterModelOutput=prescoringRaterModelOutput, + noteTopicClassifier=prescoringNoteTopicClassifier, + prescoringMetaOutput=prescoringMetaOutput, + checkFlips=checkFlips, + previousScoredNotes=previousScoredNotes, + previousAuxiliaryNoteInfo=previousAuxiliaryNoteInfo, + previousRatingCutoffTimestampMillis=previousRatingCutoffTimestampMillis, + ) + + logger.info("Starting contributor scoring") + + helpfulnessScores = run_contributor_scoring( + ratings=ratings, + scoredNotes=scoredNotes, + auxiliaryNoteInfo=auxiliaryNoteInfo, + prescoringRaterModelOutput=prescoringRaterModelOutput, + noteStatusHistory=newNoteStatusHistory, + userEnrollment=userEnrollment, + strictColumns=strictColumns, ) + + return scoredNotes, helpfulnessScores, newNoteStatusHistory, auxiliaryNoteInfo diff --git a/sourcecode/scoring/runner.py b/sourcecode/scoring/runner.py index 48038015..f37d7488 100644 --- a/sourcecode/scoring/runner.py +++ b/sourcecode/scoring/runner.py @@ -1,19 +1,49 @@ import argparse +import logging import os +import sys from . import constants as c from .enums import scorers_from_csv -from .process_data import ( - LocalDataLoader, - write_parquet_local, - write_prescoring_output, - write_tsv_local, -) +from .pandas_utils import patch_pandas +from .process_data import LocalDataLoader, tsv_reader, write_parquet_local, write_tsv_local from .run_scoring import run_scoring +import pandas as pd + + +logger = logging.getLogger("birdwatch.runner") +logger.setLevel(logging.INFO) + def parse_args(): parser = argparse.ArgumentParser("Community Notes Scoring") + parser.add_argument( + "--check-flips", + dest="check_flips", + help="Validate that note statuses align with prior runs (disable for testing)", + action="store_true", + ) + parser.add_argument( + "--nocheck-flips", + help="Disable validation that note statuses align with prior runs (use for testing)", + action="store_false", + dest="check_flips", + ) + parser.set_defaults(check_flips=True) + parser.add_argument( + "--enforce-types", + dest="enforce_types", + help="Raise errors when types in Pandas operations do not meet expectations.", + action="store_true", + ) + parser.add_argument( + "--noenforce-types", + dest="enforce_types", + help="Log to stderr when types in Pandas operations do not meet expectations.", + action="store_false", + ) + parser.set_defaults(enforce_types=True) parser.add_argument( "-e", "--enrollment", default=c.enrollmentInputPath, help="note enrollment dataset" ) @@ -38,6 +68,15 @@ def parse_args(): ) parser.set_defaults(headers=True) parser.add_argument("-n", "--notes", default=c.notesInputPath, help="note dataset") + parser.add_argument( + "--previous-scored-notes", default=None, help="previous scored notes dataset path" + ) + parser.add_argument( + "--previous-aux-note-info", default=None, help="previous aux note info dataset path" + ) + parser.add_argument( + "--previous-rating-cutoff-millis", default=None, type=int, help="previous rating cutoff millis" + ) parser.add_argument("-o", "--outdir", default=".", help="directory for output files") parser.add_argument( "--pseudoraters", @@ -82,13 +121,7 @@ def parse_args(): dest="parallel", ) parser.set_defaults(parallel=False) - parser.add_argument( - "--prescoring-delay-hours", - default=None, - type=int, - dest="prescoring_delay_hours", - help="Filter prescoring input to simulate delay in hours", - ) + parser.add_argument( "--no-parquet", help="Disable writing parquet files.", @@ -97,36 +130,80 @@ def parse_args(): dest="no_parquet", ) + parser.add_argument( + "--cutoff-timestamp-millis", + default=None, + type=int, + dest="cutoffTimestampMillis", + help="filter notes and ratings created after this time.", + ) + parser.add_argument( + "--exclude-ratings-after-a-note-got-first-status-plus-n-hours", + default=None, + type=int, + dest="excludeRatingsAfterANoteGotFirstStatusPlusNHours", + help="Exclude ratings after a note got first status plus n hours", + ) + parser.add_argument( + "--days-in-past-to-apply-post-first-status-filtering", + default=14, + type=int, + dest="daysInPastToApplyPostFirstStatusFiltering", + help="Days in past to apply post first status filtering", + ) + parser.add_argument( + "--prescoring-delay-hours", + default=None, + type=int, + dest="prescoring_delay_hours", + help="Filter prescoring input to simulate delay in hours", + ) + return parser.parse_args() -def main(): - # Parse arguments and fix timestamp, if applicable. - args = parse_args() +@patch_pandas +def _run_scorer( + args=None, + dataLoader=None, + extraScoringArgs={}, +): + assert args is not None, "args must be available" if args.epoch_millis: c.epochMillis = args.epoch_millis c.useCurrentTimeInsteadOfEpochMillisForNoteStatusHistory = False # Load input dataframes. - dataLoader = LocalDataLoader( - args.notes, - args.ratings, - args.status, - args.enrollment, - args.headers, - prescoringNoteModelOutputPath=os.path.join(args.outdir, "prescoring_scored_notes.tsv"), - prescoringRaterModelOutputPath=os.path.join(args.outdir, "prescoring_helpfulness_scores.tsv"), - ) + if dataLoader is None: + dataLoader = LocalDataLoader( + args.notes, + args.ratings, + args.status, + args.enrollment, + args.headers, + ) notes, ratings, statusHistory, userEnrollment = dataLoader.get_data() - - # Prepare callback to write first round scoring output - def prescoring_write_fn(notePath, raterPath): - return write_prescoring_output( - notePath, - raterPath, - os.path.join(args.outdir, "prescoring_scored_notes.tsv"), - os.path.join(args.outdir, "prescoring_helpfulness_scores.tsv"), + if args.previous_scored_notes is not None: + previousScoredNotes = tsv_reader( + args.previous_scored_notes, + c.noteModelOutputTSVTypeMapping, + c.noteModelOutputTSVColumns, + header=False, + convertNAToNone=False, ) + assert ( + args.previous_aux_note_info is not None + ), "previous_aux_note_info must be available if previous_scored_notes is available" + previousAuxiliaryNoteInfo = tsv_reader( + args.previous_aux_note_info, + c.auxiliaryScoredNotesTSVTypeMapping, + c.auxiliaryScoredNotesTSVColumns, + header=False, + convertNAToNone=False, + ) + else: + previousScoredNotes = None + previousAuxiliaryNoteInfo = None # Invoke scoring and user contribution algorithms. scoredNotes, helpfulnessScores, newStatus, auxNoteInfo = run_scoring( @@ -140,8 +217,15 @@ def prescoring_write_fn(notePath, raterPath): strictColumns=args.strict_columns, runParallel=args.parallel, dataLoader=dataLoader if args.parallel == True else None, - writePrescoringScoringOutputCallback=prescoring_write_fn, + cutoffTimestampMillis=args.cutoffTimestampMillis, + excludeRatingsAfterANoteGotFirstStatusPlusNHours=args.excludeRatingsAfterANoteGotFirstStatusPlusNHours, + daysInPastToApplyPostFirstStatusFiltering=args.daysInPastToApplyPostFirstStatusFiltering, filterPrescoringInputToSimulateDelayInHours=args.prescoring_delay_hours, + checkFlips=args.check_flips, + previousScoredNotes=previousScoredNotes, + previousAuxiliaryNoteInfo=previousAuxiliaryNoteInfo, + previousRatingCutoffTimestampMillis=args.previous_rating_cutoff_millis, + **extraScoringArgs, ) # Write outputs to local disk. @@ -157,5 +241,19 @@ def prescoring_write_fn(notePath, raterPath): write_parquet_local(auxNoteInfo, os.path.join(args.outdir, "aux_note_info.parquet")) +def main( + args=None, + dataLoader=None, + extraScoringArgs={}, +): + if args is None: + args = parse_args() + logger.info(f"scorer python version: {sys.version}") + logger.info(f"scorer pandas version: {pd.__version__}") + # patch_pandas requires that args are available (which matches the production binary) so + # we first parse the arguments then invoke the decorated _run_scorer. + return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs) + + if __name__ == "__main__": main() diff --git a/sourcecode/scoring/scorer.py b/sourcecode/scoring/scorer.py index 8a8d5787..39f980dc 100644 --- a/sourcecode/scoring/scorer.py +++ b/sourcecode/scoring/scorer.py @@ -1,16 +1,29 @@ from abc import ABC, abstractmethod from contextlib import contextmanager +import gc +import logging import time -from typing import Dict, List, Optional, Tuple +from typing import Dict, List, Optional, Set, Tuple from . import constants as c from .constants import FinalScoringArgs, ModelResult, PrescoringArgs +from .pandas_utils import keep_columns import numpy as np import pandas as pd import torch +logger = logging.getLogger("birdwatch.scorer") +logger.setLevel(logging.INFO) + +_IN_GROUP = "inGroup" + + +class EmptyRatingException(Exception): + """Exception rasied when no ratings are available""" + + class Scorer(ABC): """Base class which all other scorers must extend. @@ -20,12 +33,24 @@ class Scorer(ABC): exactly which columns are output and which are dropped. """ - def __init__(self, seed: Optional[int] = None, threads: int = c.defaultNumThreads) -> None: + def __init__( + self, + includedTopics: Set[str] = set(), + includedGroups: Set[int] = set(), + includeUnassigned: bool = False, + captureThreshold: Optional[float] = None, + seed: Optional[int] = None, + threads: int = c.defaultNumThreads, + ) -> None: """Configure a new Scorer object. Args: seed (int, optional): if not None, seed value to ensure deterministic execution """ + self._includedTopics = includedTopics + self._includedGroups = includedGroups + self._includeUnassigned = includeUnassigned + self._captureThreshold = captureThreshold self._seed = seed self._threads = threads @@ -36,7 +61,7 @@ def time_block(self, label): yield finally: end = time.time() - print( + logger.info( f"{self.get_name()} {label} elapsed time: {end - start:.2f} secs ({((end-start)/60.0):.2f} mins)" ) @@ -90,6 +115,30 @@ def _filter_input( ratings: ratings filtered to only contain rows of interest noteStatusHistory: noteStatusHistory filtered to only contain rows of interest """ + if (not self._includedGroups) and (not self._includedTopics): + return ratings, noteStatusHistory + logger.info(f"Filtering ratings for {self.get_name()}. Original rating length: {len(ratings)}") + # Apply topic filter + if self._includedTopics: + notes = noteTopics[noteTopics[c.noteTopicKey].isin(self._includedTopics)][[c.noteIdKey]] + ratings = ratings.merge(notes) + noteStatusHistory = noteStatusHistory.merge(notes) + logger.info(f" Ratings after topic filter: {len(ratings)}") + # Apply group filter + if self._includedGroups: + userEnrollment = userEnrollment[[c.participantIdKey, c.modelingGroupKey]].rename( + columns={c.participantIdKey: c.raterParticipantIdKey} + ) + userEnrollment.loc[:, _IN_GROUP] = ( + userEnrollment[c.modelingGroupKey].isin(self._includedGroups).astype(pd.BooleanDtype()) + ) + ratings = ratings.merge( + userEnrollment[[c.raterParticipantIdKey, _IN_GROUP]], on=c.raterParticipantIdKey, how="left" + ) + logger.info(f" Ratings without assigned group: {ratings[_IN_GROUP].isna().sum()}") + ratings = ratings.fillna({_IN_GROUP: self._includeUnassigned}) + ratings = ratings[ratings[_IN_GROUP]].drop(columns=[_IN_GROUP]) + logger.info(f" Ratings after group filter: {len(ratings)}") return ratings, noteStatusHistory def _postprocess_output( @@ -119,11 +168,33 @@ def _postprocess_output( noteScores: note scoring output from _score_notes_and_users userScores: user scoring output from _score_notes_and_users """ + if self._captureThreshold is None: + logger.info(f"Skipping postprocessing for {self.get_name()}: captureThreshold is None.") + return noteScores, userScores + # Identify notes with enough ratings from within the modeling group. + logger.info(f"Postprocessing output for {self.get_name()}") + assert self._includedGroups, "includedGroups must be set" + userEnrollment = userEnrollment[[c.participantIdKey, c.modelingGroupKey]].rename( + columns={c.participantIdKey: c.raterParticipantIdKey} + ) + userEnrollment.loc[:, _IN_GROUP] = ( + userEnrollment[c.modelingGroupKey].isin(self._includedGroups).astype(pd.BooleanDtype()) + ) + ratings = ratings.merge( + userEnrollment[[c.raterParticipantIdKey, _IN_GROUP]], on=c.raterParticipantIdKey, how="left" + ) + ratings = ratings.fillna({_IN_GROUP: self._includeUnassigned}) + ratios = ratings[[c.noteIdKey, _IN_GROUP]].groupby(c.noteIdKey).mean().reset_index() + logger.info(f" Original noteScores length: {len(noteScores)}") + noteScores = noteScores.merge( + ratios[ratios[_IN_GROUP] >= self._captureThreshold][[c.noteIdKey]] + ) + logger.info(f" Final noteScores length: {len(noteScores)}") return noteScores, userScores def _get_note_col_mapping(self) -> Dict[str, str]: """Returns a dict mapping default note column names to custom names for a specific model.""" - return {} + return {c.lowDiligenceNoteInterceptKey: c.lowDiligenceLegacyNoteInterceptKey} def _get_user_col_mapping(self) -> Dict[str, str]: """Returns a dict mapping default user column names to custom names for a specific model.""" @@ -132,7 +203,7 @@ def _get_user_col_mapping(self) -> Dict[str, str]: @abstractmethod def _prescore_notes_and_users( self, ratings: pd.DataFrame, noteStatusHistory: pd.DataFrame, userEnrollmentRaw: pd.DataFrame - ) -> Tuple[pd.DataFrame, pd.DataFrame]: + ) -> Tuple[pd.DataFrame, pd.DataFrame, c.PrescoringMetaScorerOutput]: """ Runs initial rounds of the matrix factorization scoring algorithm and returns intermediate output that can be used to initialize and reduce the runtime of final scoring. @@ -155,6 +226,7 @@ def _score_notes_and_users( noteStatusHistory: pd.DataFrame, prescoringNoteModelOutput: pd.DataFrame, prescoringRaterModelOutput: pd.DataFrame, + prescoringMetaScorerOutput: c.PrescoringMetaScorerOutput, ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Run the matrix factorization scoring algorithm. @@ -174,24 +246,40 @@ def _score_notes_and_users( userScores pd.DataFrame: one row per user containing a column for each helpfulness score. """ - def prescore(self, scoringArgs: PrescoringArgs) -> ModelResult: + def prescore(self, scoringArgs: PrescoringArgs, preserveRatings: bool = True) -> ModelResult: """ Runs initial rounds of the matrix factorization scoring algorithm and returns intermediate output that can be used to initialize and reduce the runtime of final scoring. """ torch.set_num_threads(self._threads) - print( + logger.info( f"prescore: Torch intra-op parallelism for {self.get_name()} set to: {torch.get_num_threads()}" ) - # Transform input, run core scoring algorithm, transform output. with self.time_block("Filter input"): ratings, noteStatusHistory = self._filter_input( scoringArgs.noteTopics, - scoringArgs.ratings, + keep_columns( + scoringArgs.ratings, + [ + c.noteIdKey, + c.raterParticipantIdKey, + c.helpfulNumKey, + c.helpfulnessLevelKey, + c.createdAtMillisKey, + ] + + c.notHelpfulTagsTSVOrder + + c.helpfulTagsTSVOrder, + ), scoringArgs.noteStatusHistory, scoringArgs.userEnrollment, ) + if not preserveRatings: + # Only remove ratings if we're running in parallel, since otherwise later scorers will + # need the ratings. + del scoringArgs.ratings + gc.collect() + # If there are no ratings left after filtering, then return empty dataframes. if len(ratings) == 0: return ModelResult( @@ -207,12 +295,17 @@ def prescore(self, scoringArgs: PrescoringArgs) -> ModelResult: else None ), self.get_name(), + None, ) - noteScores, userScores = self._prescore_notes_and_users( + noteScores, userScores, metaScores = self._prescore_notes_and_users( ratings, noteStatusHistory, scoringArgs.userEnrollment ) + # Returning should remove references to ratings, but manually trigger GC just to reclaim + # resources as soon as possible. + del ratings + gc.collect() # Return dataframes with specified columns in specified order # Reindex fills required columns with NaN if they aren't present in the original df. return ModelResult( @@ -226,6 +319,7 @@ def prescore(self, scoringArgs: PrescoringArgs) -> ModelResult: columns=self.get_auxiliary_note_info_cols(), fill_value=np.nan ), scorerName=self.get_name(), + metaScores=metaScores, ) def _return_empty_final_scores(self) -> ModelResult: @@ -242,6 +336,7 @@ def _return_empty_final_scores(self) -> ModelResult: else None ), scorerName=self.get_name(), + metaScores=None, ) def score_final(self, scoringArgs: FinalScoringArgs) -> ModelResult: @@ -254,7 +349,7 @@ def score_final(self, scoringArgs: FinalScoringArgs) -> ModelResult: c.scorerNameKey field of those dataframes. """ torch.set_num_threads(self._threads) - print( + logger.info( f"score_final: Torch intra-op parallelism for {self.get_name()} set to: {torch.get_num_threads()}" ) @@ -263,12 +358,20 @@ def score_final(self, scoringArgs: FinalScoringArgs) -> ModelResult: prescoringNoteModelOutput = scoringArgs.prescoringNoteModelOutput[ scoringArgs.prescoringNoteModelOutput[c.scorerNameKey] == self.get_name() ].drop(columns=c.scorerNameKey, inplace=False) + if scoringArgs.prescoringRaterModelOutput is None: return self._return_empty_final_scores() prescoringRaterModelOutput = scoringArgs.prescoringRaterModelOutput[ scoringArgs.prescoringRaterModelOutput[c.scorerNameKey] == self.get_name() ].drop(columns=c.scorerNameKey, inplace=False) + if self.get_name() not in scoringArgs.prescoringMetaOutput.metaScorerOutput: + logger.info( + f"Scorer {self.get_name()} not found in prescoringMetaOutput; returning empty scores from final scoring." + ) + return self._return_empty_final_scores() + prescoringMetaScorerOutput = scoringArgs.prescoringMetaOutput.metaScorerOutput[self.get_name()] + # Filter raw input with self.time_block("Filter input"): ratings, noteStatusHistory = self._filter_input( @@ -279,14 +382,22 @@ def score_final(self, scoringArgs: FinalScoringArgs) -> ModelResult: ) # If there are no ratings left after filtering, then return empty dataframes. if len(ratings) == 0: + logger.info( + f"No rating left after filtering for Scorer {self.get_name()}, returning empty dataframes." + ) return self._return_empty_final_scores() - noteScores, userScores = self._score_notes_and_users( - ratings=ratings, - noteStatusHistory=noteStatusHistory, - prescoringNoteModelOutput=prescoringNoteModelOutput, - prescoringRaterModelOutput=prescoringRaterModelOutput, - ) + try: + noteScores, userScores = self._score_notes_and_users( + ratings=ratings, + noteStatusHistory=noteStatusHistory, + prescoringNoteModelOutput=prescoringNoteModelOutput, + prescoringRaterModelOutput=prescoringRaterModelOutput, + prescoringMetaScorerOutput=prescoringMetaScorerOutput, + ) + except EmptyRatingException: + logger.info(f"EmptyRatingException for Scorer {self.get_name()}") + return self._return_empty_final_scores() with self.time_block("Postprocess output"): # Only some subclasses do any postprocessing. @@ -325,6 +436,7 @@ def score_final(self, scoringArgs: FinalScoringArgs) -> ModelResult: if self.get_auxiliary_note_info_cols() else None, scorerName=self.get_name(), + metaScores=None, ) def score( @@ -338,7 +450,7 @@ def score( This function is deprecated and only included for testing purposes for now. Not intended to be called in main code flow (since the scorer will be split, and this function calls both phases sequentially) """ - print( + logger.info( "CALLED DEPRECATED scorer.score() function. Prefer sequentially calling prescore() then score_final()." ) @@ -355,6 +467,14 @@ def score( prescoringModelResult.scoredNotes[c.scorerNameKey] = prescoringModelResult.scorerName if prescoringModelResult.helpfulnessScores is not None: prescoringModelResult.helpfulnessScores[c.scorerNameKey] = prescoringModelResult.scorerName + if ( + prescoringModelResult.metaScores is not None and prescoringModelResult.scorerName is not None + ): + prescoringMetaOutput = c.PrescoringMetaOutput( + metaScorerOutput={prescoringModelResult.scorerName: prescoringModelResult.metaScores} + ) + else: + prescoringMetaOutput = c.PrescoringMetaOutput(metaScorerOutput={}) finalScoringArgs = FinalScoringArgs( noteTopics=noteTopics, @@ -363,6 +483,7 @@ def score( userEnrollment=userEnrollment, prescoringNoteModelOutput=prescoringModelResult.scoredNotes, prescoringRaterModelOutput=prescoringModelResult.helpfulnessScores, + prescoringMetaOutput=prescoringMetaOutput, ) finalModelResult = self.score_final(finalScoringArgs) return ( diff --git a/sourcecode/scoring/scoring_rules.py b/sourcecode/scoring/scoring_rules.py index bf55a411..eeb52a47 100644 --- a/sourcecode/scoring/scoring_rules.py +++ b/sourcecode/scoring/scoring_rules.py @@ -1,16 +1,20 @@ from abc import ABC, abstractmethod from collections import namedtuple from enum import Enum -from typing import Callable, List, Optional, Set, Tuple +import logging +from typing import Any, Callable, Dict, List, Optional, Set, Tuple -from . import constants as c, tag_filter +from . import constants as c from .enums import Topics -from .explanation_tags import top_tags +from .explanation_tags import get_top_two_tags_for_note import numpy as np import pandas as pd +logger = logging.getLogger("birdwatch.scoring_rules") +logger.setLevel(logging.INFO) + RuleAndVersion = namedtuple("RuleAndVersion", ["ruleName", "ruleVersion", "lockingEnabled"]) """namedtuple identifying ScoringRule with a name and tracking revisions with a version.""" @@ -31,6 +35,7 @@ class RuleID(Enum): INCORRECT_OUTLIER = RuleAndVersion("FilterIncorrect", "1.0", False) LOW_DILIGENCE = RuleAndVersion("FilterLowDiligence", "1.0", False) LARGE_FACTOR = RuleAndVersion("FilterLargeFactor", "1.0", False) + LOW_INTERCEPT = RuleAndVersion("RejectLowIntercept", "1.0", False) # Rules used in _meta_score. META_INITIAL_NMR = RuleAndVersion("MetaInitialNMR", "1.0", False) @@ -52,11 +57,13 @@ class RuleID(Enum): GROUP_MODEL_12 = RuleAndVersion("GroupModel12", "1.1", False) GROUP_MODEL_13 = RuleAndVersion("GroupModel13", "1.1", True) GROUP_MODEL_14 = RuleAndVersion("GroupModel14", "1.1", True) - INSUFFICIENT_EXPLANATION = RuleAndVersion("InsufficientExplanation", "1.0", True) - SCORING_DRIFT_GUARD = RuleAndVersion("ScoringDriftGuard", "1.0", False) TOPIC_MODEL_1 = RuleAndVersion("TopicModel01", "1.0", False) TOPIC_MODEL_2 = RuleAndVersion("TopicModel02", "1.0", False) TOPIC_MODEL_3 = RuleAndVersion("TopicModel03", "1.0", False) + MULTI_GROUP_MODEL_1 = RuleAndVersion("MultiGroupModel01", "1.0", False) + INSUFFICIENT_EXPLANATION = RuleAndVersion("InsufficientExplanation", "1.0", True) + SCORING_DRIFT_GUARD = RuleAndVersion("ScoringDriftGuard", "1.0", False) + NMR_DUE_TO_MIN_STABLE_CRH_TIME = RuleAndVersion("NmrDueToMinStableCrhTime", "1.0", False) def get_name(self) -> str: """Returns a string combining the name and version to uniquely name the logic of the ScoringRule.""" @@ -178,6 +185,8 @@ def __init__( ruleID: RuleID, dependencies: Set[RuleID], sourceColumn: str, + checkFirmReject: bool = False, + filterColumnPairs: List[Tuple[str, Any]] = [], ): """Propagate the note status from sourceColumn when the status is not NaN. @@ -188,20 +197,42 @@ def __init__( """ super().__init__(ruleID, dependencies) self._sourceColumn = sourceColumn + self._checkFirmReject = checkFirmReject + self._filterColumnPairs = filterColumnPairs def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]: """Propagates any status set in sourceColumn when it is non-NaN.""" - notesWithStatus = ~noteStats[self._sourceColumn].isna() + # If necessary, prune noteStats according to prior firm rejects + if self._checkFirmReject: + coreRejects = noteStats[c.coreRatingStatusKey].isin( + {c.firmReject, c.currentlyRatedNotHelpful} + ) + expansionRejects = noteStats[c.expansionRatingStatusKey].isin( + {c.firmReject, c.currentlyRatedNotHelpful} + ) + crhBlocked = coreRejects | (noteStats[c.coreRatingStatusKey].isna() & expansionRejects) + crhNotes = noteStats[self._sourceColumn] == c.currentlyRatedHelpful + noteStats = noteStats[~(crhBlocked & crhNotes)] + # If necessary, prune noteStatus based on filter column pairs + if self._filterColumnPairs: + for col, value in self._filterColumnPairs: + noteStats = noteStats[noteStats[col] == value] + # Generate the set of note status updates + statusUpdateRows = ~noteStats[self._sourceColumn].isna() + noteStatusUpdates = noteStats[statusUpdateRows][[c.noteIdKey, self._sourceColumn]].rename( + columns={self._sourceColumn: statusColumn} + ) + # Rename FIRM_REJECT to NEEDS_MORE_RATINGS since the status will be exported as the final status + noteStatusUpdates.loc[ + noteStatusUpdates[statusColumn] == c.firmReject, statusColumn + ] = c.needsMoreRatings assert ( - noteStats.loc[notesWithStatus, self._sourceColumn] + noteStatusUpdates[statusColumn] .isin({c.currentlyRatedHelpful, c.currentlyRatedNotHelpful, c.needsMoreRatings}) .all() ), "status must be set to CRH, CRNH or NMR" - noteStatusUpdates = noteStats[notesWithStatus][[c.noteIdKey, self._sourceColumn]].rename( - columns={self._sourceColumn: statusColumn} - ) return (noteStatusUpdates, None) @@ -211,9 +242,8 @@ def __init__( ruleID: RuleID, dependencies: Set[RuleID], status: str, - crhSuperThreshold: float, + tagFilterThresholds: Dict[str, float], minAdjustedTotal: float = 2.5, - tagRatioPercentile: int = 95, ): """Filter CRH notes for outliers with high levels of any particular tag. @@ -221,60 +251,61 @@ def __init__( rule: enum corresponding to a namedtuple defining a rule name and version string for the ScoringRule. dependencies: Rules which must run before this rule can run. status: the status which each note should be set to (e.g. CRH, CRNH, NMR) - crhSuperThreshold: If the note intercept exceeds the crhSuperThreshold, then the - tag filter is disabled. - tagRatioPercentile: For a filter to trigger, the adjusted ratio value for a - tag must exceed Nth percentile for notes currently rated as CRH. minAdjustedTotal: For a filter to trigger, the adjusted total of a tag must exceed the minAdjustedTotal. + tagFilterThresholds: For a filter to trigger, the adjusted ratio value for a + tag must exceed the given value (computed in prescoring as the Nth percentile + for notes currently rated as CRH). """ super().__init__(ruleID, dependencies) self._status = status - self._tagRatioPercentile = tagRatioPercentile self._minAdjustedTotal = minAdjustedTotal - self._crhSuperThreshold = crhSuperThreshold + self._tagFilterThresholds = tagFilterThresholds def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Returns notes on track for CRH with high levels of any tag to receive NMR status.""" - # Prune noteStats to only include CRH notes. - crhNotes = currentLabels[currentLabels[statusColumn] == c.currentlyRatedHelpful][[c.noteIdKey]] - crhStats = noteStats.merge(crhNotes, on=c.noteIdKey, how="inner") - print(f"CRH notes prior to tag filtering: {len(crhStats)}") - print( - f"CRH notes above crhSuperThreshold: {sum(crhStats[c.internalNoteInterceptKey] > self._crhSuperThreshold)}" - ) + # Prune noteStats to exclude CRNH notes. CRNH will have stronger downstream effects, so + # we don't want to over-write that status. + candidateNotes = currentLabels[currentLabels[statusColumn] != c.currentlyRatedNotHelpful][ + [c.noteIdKey] + ] + noteStats = noteStats.merge(candidateNotes, on=c.noteIdKey, how="inner") + logger.info(f"Candidate notes prior to tag filtering: {len(noteStats)}") + # Identify impacted notes. - thresholds = tag_filter.get_tag_thresholds(crhStats, self._tagRatioPercentile) - impactedNotes = pd.DataFrame.from_dict({c.noteIdKey: [], c.activeFilterTagsKey: []}).astype( - {c.noteIdKey: np.int64} + impactedNotes = pd.DataFrame.from_dict( + { + c.noteIdKey: pd.Series([], dtype=np.int64), + c.activeFilterTagsKey: pd.Series([], dtype=object), + } ) - print("Checking note tags:") + logger.info("Checking note tags:") for tag in c.notHelpfulTagsTSVOrder: adjustedColumn = f"{tag}{c.adjustedSuffix}" adjustedRatioColumn = f"{adjustedColumn}{c.ratioSuffix}" - print(tag) - print(f" ratio threshold: {thresholds[adjustedRatioColumn]}") + logger.info(tag) if tag == c.notHelpfulHardToUnderstandKey: - print(f"outlier filtering disabled for tag: {tag}") + logger.info(f"outlier filtering disabled for tag: {tag}") continue - tagFilteredNotes = crhStats[ + tagFilteredNotes = noteStats[ # Adjusted total must pass minimum threhsold set across all tags. - (crhStats[adjustedColumn] > self._minAdjustedTotal) + (noteStats[adjustedColumn] > self._minAdjustedTotal) # Adjusted ratio must exceed percentile based total for this specific tag. - & (crhStats[adjustedRatioColumn] > thresholds[adjustedRatioColumn]) + & (noteStats[adjustedRatioColumn] > self._tagFilterThresholds[adjustedRatioColumn]) ][c.noteIdKey] impactedNotes = pd.concat( - [impactedNotes, pd.DataFrame({c.noteIdKey: tagFilteredNotes, c.activeFilterTagsKey: tag})] + [impactedNotes, pd.DataFrame({c.noteIdKey: tagFilteredNotes, c.activeFilterTagsKey: tag})], + unsafeAllowed=[c.defaultIndexKey, c.activeFilterTagsKey], ) # log and consolidate imapcted notes - print(f"Total {{note, tag}} pairs where tag filter logic triggered: {len(impactedNotes)}") + logger.info(f"Total {{note, tag}} pairs where tag filter logic triggered: {len(impactedNotes)}") impactedNotes = impactedNotes.groupby(c.noteIdKey).aggregate(list).reset_index() impactedNotes[c.activeFilterTagsKey] = [ ",".join(tags) for tags in impactedNotes[c.activeFilterTagsKey] ] - print(f"Total unique notes impacted by tag filtering: {len(impactedNotes)}") + logger.info(f"Total unique notes impacted by tag filtering: {len(impactedNotes)}") noteStatusUpdates = impactedNotes[[c.noteIdKey]].drop_duplicates() noteStatusUpdates[statusColumn] = self._status return (noteStatusUpdates, impactedNotes) @@ -289,7 +320,6 @@ def __init__( tagThreshold: int, voteThreshold: int, weightedTotalVotes: float, - superThreshold: Optional[float], ): """Filter CRH notes for outliers with high levels of incorrect tag from similar factor raters. @@ -301,39 +331,34 @@ def __init__( voteThreshold: threshold for number of included raters (raters must have issued a NH tag to be inclueed) weightedTotalVotes: For the filter to trigger, the sum of weighted incorrect votes must exceed the minAdjustedTotal. - superThreshold: if set, allow notes with an intercept above threshold to bypass the filter. - colSuffix: string suffix to apply to lookup columns """ super().__init__(ruleID, dependencies) self._status = status self._tagThreshold = tagThreshold self._voteThreshold = voteThreshold self._weightedTotalVotes = weightedTotalVotes - self._superThreshold = superThreshold def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Returns notes on track for CRH with high levels of any tag to receive NMR status.""" - # Prune noteStats to only include CRH notes. - crhNotes = currentLabels[currentLabels[statusColumn] == c.currentlyRatedHelpful][[c.noteIdKey]] - crhStats = noteStats.merge(crhNotes, on=c.noteIdKey, how="inner") + # Prune noteStats to exclude CRNH notes. CRNH will have stronger downstream effects, so + # we don't want to over-write that status. + candidateNotes = currentLabels[currentLabels[statusColumn] != c.currentlyRatedNotHelpful][ + [c.noteIdKey] + ] + noteStats = noteStats.merge(candidateNotes, on=c.noteIdKey, how="inner") # Identify impacted notes. - noteStatusUpdates = crhStats.loc[ - (crhStats["notHelpfulIncorrect_interval"] >= self._tagThreshold) - & (crhStats["num_voters_interval"] >= self._voteThreshold) - & (crhStats["tf_idf_incorrect_interval"] >= self._weightedTotalVotes) - & ( - True - if self._superThreshold is None - else crhStats[c.internalNoteInterceptKey] < self._superThreshold - ) + noteStatusUpdates = noteStats.loc[ + (noteStats["notHelpfulIncorrect_interval"] >= self._tagThreshold) + & (noteStats["num_voters_interval"] >= self._voteThreshold) + & (noteStats["tf_idf_incorrect_interval"] >= self._weightedTotalVotes) ][[c.noteIdKey]] pd.testing.assert_frame_equal(noteStatusUpdates, noteStatusUpdates.drop_duplicates()) - print(f"Total notes impacted by incorrect filtering: {len(noteStatusUpdates)}") + logger.info(f"Total notes impacted by incorrect filtering: {len(noteStatusUpdates)}") noteStatusUpdates[statusColumn] = self._status return (noteStatusUpdates, None) @@ -363,18 +388,21 @@ def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Returns notes on track for CRH with a high low diligence intercept.""" - # Prune noteStats to only include CRH notes. - crhNotes = currentLabels[currentLabels[statusColumn] == c.currentlyRatedHelpful][[c.noteIdKey]] - crhStats = noteStats.merge(crhNotes, on=c.noteIdKey, how="inner") + # Prune noteStats to exclude CRNH notes. CRNH will have stronger downstream effects, so + # we don't want to over-write that status. + candidateNotes = currentLabels[currentLabels[statusColumn] != c.currentlyRatedNotHelpful][ + [c.noteIdKey] + ] + noteStats = noteStats.merge(candidateNotes, on=c.noteIdKey, how="inner") # Identify impacted notes. - noteStatusUpdates = crhStats.loc[ - crhStats[c.lowDiligenceInterceptKey] > self._interceptThreshold + noteStatusUpdates = noteStats.loc[ + noteStats[c.lowDiligenceNoteInterceptKey] > self._interceptThreshold ][[c.noteIdKey]] pd.testing.assert_frame_equal(noteStatusUpdates, noteStatusUpdates.drop_duplicates()) - print(f"Total notes impacted by low diligence filtering: {len(noteStatusUpdates)}") + logger.info(f"Total notes impacted by low diligence filtering: {len(noteStatusUpdates)}") noteStatusUpdates[statusColumn] = self._status return (noteStatusUpdates, None) @@ -404,23 +432,176 @@ def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, pd.DataFrame]: """Returns notes on track for CRH with a high low diligence intercept.""" - # Prune noteStats to only include CRH notes. - crhNotes = currentLabels[currentLabels[statusColumn] == c.currentlyRatedHelpful][[c.noteIdKey]] - crhStats = noteStats.merge(crhNotes, on=c.noteIdKey, how="inner") + # Prune noteStats to exclude CRNH notes. CRNH will have stronger downstream effects, so + # we don't want to over-write that status. + candidateNotes = currentLabels[currentLabels[statusColumn] == c.currentlyRatedHelpful][ + [c.noteIdKey] + ] + noteStats = noteStats.merge(candidateNotes, on=c.noteIdKey, how="inner") # Identify impacted notes. - noteStatusUpdates = crhStats.loc[ - crhStats[c.internalNoteFactor1Key].abs() > self._factorThreshold + noteStatusUpdates = noteStats.loc[ + noteStats[c.internalNoteFactor1Key].abs() > self._factorThreshold ][[c.noteIdKey]] pd.testing.assert_frame_equal(noteStatusUpdates, noteStatusUpdates.drop_duplicates()) - print(f"Total notes impacted by large factor filtering: {len(noteStatusUpdates)}") + logger.info(f"Total notes impacted by large factor filtering: {len(noteStatusUpdates)}") noteStatusUpdates[statusColumn] = self._status return (noteStatusUpdates, None) +class NmrDueToMinStableCrhTime(ScoringRule): + def __init__( + self, + ruleID: RuleID, + dependencies: Set[RuleID], + requiredStableCrhMinutesThreshold: int = 30, + ): + """Make CRH notes NMR if it hasn't been stably CRH >= requiredStableCrhMinutesThreshold. + + Args: + rule: enum corresponding to a namedtuple defining a rule name and version string for the + ScoringRule. + dependencies: Rules which must run before this rule can run. + requiredStableCrhMinutesThreshold: threshold for required stable CRH time, in minutes. + """ + super().__init__(ruleID, dependencies) + self.requiredStableCrhMinutesThreshold = requiredStableCrhMinutesThreshold + + def score_notes( + self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str + ) -> Tuple[pd.DataFrame, pd.DataFrame]: + # Prune noteStats to exclude CRH notes (CRHed before current scoring run). + noteStats = noteStats[noteStats[c.currentLabelKey] != c.currentlyRatedHelpful] + noteStats = noteStats.merge(currentLabels, on=c.noteIdKey, how="inner") + + # Identify impacted notes: + # (1) CRH from current run + # (A) If timestampMillisOfNmrDueToMinStableCrhTime doesn't exist: + # Set status to NMR, set timestampMillisOfNmrDueToMinStableCrhTime to now. + # (B) Otherwise: + # (a) If it has been long enough since timestampMillisOfNmrDueToMinStableCrhTime, + # set status to CRH, clear timestampMillisOfNmrDueToMinStableCrhTime + # (b) Otherwise, set status to NMR. + # (2) Non-CRH from current run and timestampMillisOfNmrDueToMinStableCrhTime exists. + # Clear timestampMillisOfNmrDueToMinStableCrhTime. + noteStatusUpdates = noteStats.loc[ + (noteStats[statusColumn] == c.currentlyRatedHelpful) + | ( + ~noteStats[c.timestampMillisOfNmrDueToMinStableCrhTimeKey].isna() + & (noteStats[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] > 0) + ) + ][[c.noteIdKey, c.timestampMillisOfNmrDueToMinStableCrhTimeKey, statusColumn]] + + pd.testing.assert_frame_equal(noteStatusUpdates, noteStatusUpdates.drop_duplicates()) + + newStatusColumn = statusColumn + "_new" + noteStatusUpdates[newStatusColumn] = np.nan + noteStatusUpdates[c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey] = noteStatusUpdates[ + c.timestampMillisOfNmrDueToMinStableCrhTimeKey + ] + # (1)-(A) + noteStatusUpdates.loc[ + (noteStatusUpdates[statusColumn] == c.currentlyRatedHelpful) + & ( + noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey].isna() + | (noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] <= 0) + ), + [newStatusColumn, c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey], + ] = [c.needsMoreRatings, c.epochMillis] + # (1)-(B)-(a) + noteStatusUpdates.loc[ + (noteStatusUpdates[statusColumn] == c.currentlyRatedHelpful) + & ( + ~noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey].isna() + & (noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] > 0) + ) + & ( + c.epochMillis - noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] + >= self.requiredStableCrhMinutesThreshold * 60 * 1000 + ), + [newStatusColumn, c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey], + ] = [c.currentlyRatedHelpful, -1] + # (1)-(B)-(b) + noteStatusUpdates.loc[ + (noteStatusUpdates[statusColumn] == c.currentlyRatedHelpful) + & ( + ~noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey].isna() + & (noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] > 0) + ) + & ( + c.epochMillis - noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] + < self.requiredStableCrhMinutesThreshold * 60 * 1000 + ), + newStatusColumn, + ] = c.needsMoreRatings + # (2) + noteStatusUpdates.loc[ + (noteStatusUpdates[statusColumn] != c.currentlyRatedHelpful) + & ( + ~noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey].isna() + & (noteStatusUpdates[c.timestampMillisOfNmrDueToMinStableCrhTimeKey] > 0) + ), + c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey, + ] = -1 + + noteStatusUpdatesWithStatusChange = noteStatusUpdates.loc[ + (noteStatusUpdates[statusColumn] == c.currentlyRatedHelpful) + & (noteStatusUpdates[newStatusColumn] == c.needsMoreRatings) + ][[c.noteIdKey, newStatusColumn]] + noteStatusUpdatesWithStatusChange.rename(columns={newStatusColumn: statusColumn}, inplace=True) + + logger.info( + f"Total notes impacted (CRH->NMR) by NmrDueToMinStableCrhTime: " + f"{len(noteStatusUpdatesWithStatusChange)}" + ) + + return ( + noteStatusUpdatesWithStatusChange, + noteStatusUpdates[[c.noteIdKey, c.updatedTimestampMillisOfNmrDueToMinStableCrhTimeKey]], + ) + + +class RejectLowIntercept(ScoringRule): + def __init__( + self, + ruleID: RuleID, + dependencies: Set[RuleID], + status: str, + firmRejectThreshold: float, + ): + """Set notes with an intercept below firmRejectThreshold to firmReject, preventing downstream CRH. + + Args: + rule: enum corresponding to a namedtuple defining a rule name and version string for the ScoringRule. + dependencies: Rules which must run before this rule can run. + status: the status which each note should be set to (e.g. CRH, CRNH, NMR) + firmRejectThreshold: firmReject notes with an intercept below this threshold + """ + super().__init__(ruleID, dependencies) + self._status = status + self._firmRejectThreshold = firmRejectThreshold + + def score_notes( + self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str + ) -> Tuple[pd.DataFrame, pd.DataFrame]: + """Returns notes on track for NMR with an intercept below firmRejectThreshold.""" + # Require that notes are currently NMR. If the note is already on track for firmReject, no need + # to update the status since a more specific rule has already acted on the note. If the note is + # on track for CRNH, leave status unchanged so the finalRatingStatus is CRNH. + candidateNotes = currentLabels[currentLabels[statusColumn] != c.currentlyRatedNotHelpful][ + [c.noteIdKey] + ] + noteStats = noteStats.merge(candidateNotes, on=c.noteIdKey, how="inner") + noteStatusUpdates = noteStats.loc[ + (noteStats[c.internalNoteInterceptKey] < self._firmRejectThreshold) + ][[c.noteIdKey]] + noteStatusUpdates[statusColumn] = self._status + return (noteStatusUpdates, None) + + class ApplyGroupModelResult(ScoringRule): def __init__( self, @@ -429,7 +610,7 @@ def __init__( groupNumber: int, coreCrhThreshold: Optional[float], expansionCrhThreshold: Optional[float], - minSafeguardThreshold: Optional[float] = 0.3, + minSafeguardThreshold: float = 0.3, ): """Set CRH status based on a modeling group result. @@ -457,18 +638,20 @@ def __init__( self._minSafeguardThreshold = minSafeguardThreshold self._coreCrhThreshold = coreCrhThreshold self._expansionCrhThreshold = expansionCrhThreshold - if self._minSafeguardThreshold is None: - assert self._coreCrhThreshold is None - assert self._expansionCrhThreshold is None - else: - assert self._coreCrhThreshold is not None - assert self._expansionCrhThreshold is not None + assert self._minSafeguardThreshold is not None def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]: """Flip notes from NMR to CRH based on group models and subject to core/expansion model safeguards.""" - # Identify notes which were CRH from the applicable group model. + # Identify notes blocked from CRH status due to FR/CRNH status in core or expansion + coreRejects = noteStats[c.coreRatingStatusKey].isin({c.firmReject, c.currentlyRatedNotHelpful}) + expansionRejects = noteStats[c.expansionRatingStatusKey].isin( + {c.firmReject, c.currentlyRatedNotHelpful} + ) + blocked = coreRejects | (noteStats[c.coreRatingStatusKey].isna() & expansionRejects) + noteStats = noteStats[~blocked] + # Generate the set of note status updates probationaryCRHNotes = noteStats[ (noteStats[c.groupRatingStatusKey] == c.currentlyRatedHelpful) & (noteStats[c.modelingGroupKey] == self._groupNumber) @@ -480,40 +663,42 @@ def score_notes( # Identify candidate note status updates noteStatusUpdates = probationaryCRHNotes.merge(currentNMRNotes, on=c.noteIdKey, how="inner") # If necessary, identify notes which pass score bound checks for expansion and core models. - if self._minSafeguardThreshold is not None: - # Apply min and max threhsolds to core and expansion intercepts - noteStats = noteStats[ - [c.noteIdKey, c.coreNoteInterceptKey, c.expansionNoteInterceptKey] - ].copy() - noteStats["core"] = (noteStats[c.coreNoteInterceptKey] < self._coreCrhThreshold) & ( - noteStats[c.coreNoteInterceptKey] > self._minSafeguardThreshold + # Apply min and max threhsolds to core and expansion intercepts + noteStats = noteStats[[c.noteIdKey, c.coreNoteInterceptKey, c.expansionNoteInterceptKey]].copy() + noteStats["core"] = noteStats[c.coreNoteInterceptKey] > self._minSafeguardThreshold + if self._coreCrhThreshold is not None: + noteStats["core"] = noteStats["core"] & ( + noteStats[c.coreNoteInterceptKey] < self._coreCrhThreshold ) - noteStats.loc[noteStats[c.coreNoteInterceptKey].isna(), "core"] = np.nan - noteStats["expansion"] = ( + noteStats.loc[noteStats[c.coreNoteInterceptKey].isna(), "core"] = np.nan + noteStats["expansion"] = noteStats[c.expansionNoteInterceptKey] > self._minSafeguardThreshold + if self._expansionCrhThreshold is not None: + noteStats["expansion"] = noteStats["expansion"] & ( noteStats[c.expansionNoteInterceptKey] < self._expansionCrhThreshold - ) & (noteStats[c.expansionNoteInterceptKey] > self._minSafeguardThreshold) - noteStats.loc[noteStats[c.expansionNoteInterceptKey].isna(), "expansion"] = np.nan - - # Prioritize core over expansion intercepts when available - def _get_value(row): - idx = row.first_valid_index() - # If either core or expansion had an intercept then return whether it was in the valid - # range. If neither had an intercept, return False. Preference is given to core due - # to the ordering when selecting columns from noteStats below. - if idx is None: - return False - elif row[idx] == 1.0: - return True - elif row[idx] == 0.0: - return False - else: - assert False, f"unexpected value: {row[idx]}" - + ) + noteStats.loc[noteStats[c.expansionNoteInterceptKey].isna(), "expansion"] = np.nan + + # Prioritize core over expansion intercepts when available + def _get_value(row): + idx = row.first_valid_index() + # If either core or expansion had an intercept then return whether it was in the valid + # range. If neither had an intercept, return False. Preference is given to core due + # to the ordering when selecting columns from noteStats below. + if idx is None: + return False + elif row[idx] == 1.0: + return True + elif row[idx] == 0.0: + return False + else: + assert False, f"unexpected value: {row[idx]}" + + with c.time_block("Get value apply for group model"): noteStats["actionable"] = noteStats[["core", "expansion"]].apply(_get_value, axis=1) - # Filter set of note status updates to only include actionable notes - actionableNotes = noteStats[noteStats["actionable"]][[c.noteIdKey]] - noteStatusUpdates = noteStatusUpdates.merge(actionableNotes, on=c.noteIdKey, how="inner") + # Filter set of note status updates to only include actionable notes + actionableNotes = noteStats[noteStats["actionable"]][[c.noteIdKey]] + noteStatusUpdates = noteStatusUpdates.merge(actionableNotes, on=c.noteIdKey, how="inner") # Set note status and return noteStatusUpdates[statusColumn] = c.currentlyRatedHelpful @@ -540,7 +725,6 @@ def __init__( status: the status which each note should be set to (e.g. CRH, CRNH, NMR) minRatingsToGetTag: min number occurrences to assign a tag to a note. minTagsNeededForStatus: min tags assigned before a note can be CRH/CRNH - tagsConsidered: set of tags to consider for *all* notes. """ super().__init__(ruleID, dependencies) self._status = status @@ -551,61 +735,75 @@ def __init__( def score_notes( self, noteStats: pd.DataFrame, currentLabels: pd.DataFrame, statusColumn: str ) -> Tuple[pd.DataFrame, pd.DataFrame]: - """Sets Top Tags, returns notes on track for CRH / CRNH with insufficient to receive NMR status.""" + """Sets Top Tags inplace on noteStats, + returns notes on track for CRH / CRNH with insufficient to receive NMR status.""" + noteStats[c.firstTagKey] = noteStats[c.firstTagKey].astype(object) + noteStats[c.secondTagKey] = noteStats[c.secondTagKey].astype(object) if self._tagsConsidered is None: - # Set Top Tags + # Set Top CRH Tags crh_idx = noteStats[c.noteIdKey].isin( currentLabels.loc[currentLabels[statusColumn] == c.currentlyRatedHelpful, c.noteIdKey] ) - noteStats.loc[crh_idx, :] = noteStats.loc[crh_idx, :].apply( - lambda row: top_tags( - row, self._minRatingsToGetTag, self._minTagsNeededForStatus, c.helpfulTagsTiebreakOrder - ), - axis=1, + topCrhTags = get_top_two_tags_for_note( + noteStats.loc[crh_idx, :], + self._minTagsNeededForStatus, + self._minRatingsToGetTag, + c.helpfulTagsTiebreakOrder, ) + noteStats.set_index(c.noteIdKey, inplace=True) + noteStats.loc[topCrhTags[c.noteIdKey], c.firstTagKey] = topCrhTags[c.firstTagKey] + noteStats.loc[topCrhTags[c.noteIdKey], c.secondTagKey] = topCrhTags[c.secondTagKey] + noteStats.reset_index(inplace=True) + + # Set Top CRNH Tags crnh_idx = noteStats[c.noteIdKey].isin( currentLabels.loc[currentLabels[statusColumn] == c.currentlyRatedNotHelpful, c.noteIdKey] ) - noteStats.loc[crnh_idx, :] = noteStats.loc[crnh_idx, :].apply( - lambda row: top_tags( - row, self._minRatingsToGetTag, self._minTagsNeededForStatus, c.notHelpfulTagsTiebreakOrder - ), - axis=1, + topCrnhTags = get_top_two_tags_for_note( + noteStats.loc[crnh_idx, :], + self._minRatingsToGetTag, + self._minTagsNeededForStatus, + c.notHelpfulTagsTiebreakOrder, ) + noteStats.set_index(c.noteIdKey, inplace=True) + noteStats.loc[topCrnhTags[c.noteIdKey], c.firstTagKey] = topCrnhTags[c.firstTagKey] + noteStats.loc[topCrnhTags[c.noteIdKey], c.secondTagKey] = topCrnhTags[c.secondTagKey] + noteStats.reset_index(inplace=True) else: - noteStats = noteStats.apply( - lambda row: top_tags( - row, self._minRatingsToGetTag, self._minTagsNeededForStatus, self._tagsConsidered - ), - axis=1, + topTags = get_top_two_tags_for_note( + noteStats, + self._minRatingsToGetTag, + self._minTagsNeededForStatus, + self._tagsConsidered, + ) + noteStats.loc[:, c.firstTagKey] = topTags[c.firstTagKey] + noteStats.loc[:, c.secondTagKey] = topTags[c.secondTagKey] + + noteStats[c.firstTagKey] = noteStats[c.firstTagKey].astype(object) + noteStats[c.secondTagKey] = noteStats[c.secondTagKey].astype(object) + + with c.time_block("Insufficient explanation: post-processing"): + # Prune noteStats to only include CRH / CRNH notes. + crNotes = currentLabels[ + (currentLabels[statusColumn] == c.currentlyRatedHelpful) + | (currentLabels[statusColumn] == c.currentlyRatedNotHelpful) + ][[c.noteIdKey]] + crStats = noteStats.merge(crNotes, on=c.noteIdKey, how="inner") + logger.info( + f"CRH / CRNH notes prior to filtering for insufficient explanation: {len(crStats)}" ) - # For unclear reasons, the "apply" above converts the noteId column to a float. This cast - # guarantees that the type of the noteId column remains int64. Note that the cast will fail - # if the noteId column includes nan values. - # - # See links below for more context: - # https://stackoverflow.com/questions/40251948/stop-pandas-from-converting-int-to-float-due-to-an-insertion-in-another-column - # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.convert_dtypes.html - noteStats[c.noteIdKey] = noteStats[c.noteIdKey].astype(np.int64) - - # Prune noteStats to only include CRH / CRNH notes. - crNotes = currentLabels[ - (currentLabels[statusColumn] == c.currentlyRatedHelpful) - | (currentLabels[statusColumn] == c.currentlyRatedNotHelpful) - ][[c.noteIdKey]] - crStats = noteStats.merge(crNotes, on=c.noteIdKey, how="inner") - print(f"CRH / CRNH notes prior to filtering for insufficient explanation: {len(crStats)}") - # Identify impacted notes. - noteStatusUpdates = crStats.loc[ - (~crStats[[c.firstTagKey, c.secondTagKey]].isna()).sum(axis=1) < self._minTagsNeededForStatus - ][[c.noteIdKey]] + # Identify impacted notes. + noteStatusUpdates = crStats.loc[ + (~crStats[[c.firstTagKey, c.secondTagKey]].isna()).sum(axis=1) + < self._minTagsNeededForStatus + ][[c.noteIdKey]] - pd.testing.assert_frame_equal(noteStatusUpdates, noteStatusUpdates.drop_duplicates()) + pd.testing.assert_frame_equal(noteStatusUpdates, noteStatusUpdates.drop_duplicates()) - print(f"Total notes impacted by explanation filtering: {len(noteStatusUpdates)}") - noteStatusUpdates[statusColumn] = self._status + logger.info(f"Total notes impacted by explanation filtering: {len(noteStatusUpdates)}") + noteStatusUpdates[statusColumn] = self._status return (noteStatusUpdates, None) @@ -811,64 +1009,83 @@ def apply_scoring_rules( """ # Initialize empty dataframes to store labels for each note and which rules impacted # scoring for each note. - noteLabels = pd.DataFrame.from_dict({c.noteIdKey: [], statusColumn: []}).astype( - {c.noteIdKey: np.int64} + noteLabels = pd.DataFrame.from_dict( + {c.noteIdKey: pd.Series([], dtype=np.int64), statusColumn: pd.Series([], dtype=object)} ) - noteRules = pd.DataFrame.from_dict({c.noteIdKey: [], ruleColumn: []}).astype( - {c.noteIdKey: np.int64} + noteRules = pd.DataFrame.from_dict( + {c.noteIdKey: pd.Series([], dtype=np.int64), ruleColumn: pd.Series([], dtype=object)} ) - noteColumns = pd.DataFrame.from_dict({c.noteIdKey: []}).astype({c.noteIdKey: np.int64}) + noteColumns = pd.DataFrame.from_dict({c.noteIdKey: pd.Series([], dtype=np.int64)}) # Establish state to enforce rule dependencies. ruleIDs: Set[RuleID] = set() # Successively apply each rule for rule in rules: - print(f"Applying scoring rule: {rule.get_name()}") - rule.check_dependencies(ruleIDs) - assert rule.get_rule_id() not in ruleIDs, f"repeat ruleID: {rule.get_name()}" - ruleIDs.add(rule.get_rule_id()) - noteStatusUpdates, additionalColumns = rule.score_notes(noteStats, noteLabels, statusColumn) - if additionalColumns is not None: - assert set(noteStatusUpdates[c.noteIdKey]) == set(additionalColumns[c.noteIdKey]) - # Update noteLabels, which will always hold at most one label per note. - noteLabels = pd.concat([noteLabels, noteStatusUpdates]).groupby(c.noteIdKey).tail(1) - # Update note rules to have one row per rule which was active for a note - noteRules = pd.concat( - [ - noteRules, - pd.DataFrame.from_dict( - {c.noteIdKey: noteStatusUpdates[c.noteIdKey], ruleColumn: rule.get_name()} - ), - ] + with c.time_block(f"Applying scoring rule: {rule.get_name()}"): + logger.info(f"Applying scoring rule: {rule.get_name()}") + rule.check_dependencies(ruleIDs) + assert rule.get_rule_id() not in ruleIDs, f"repeat ruleID: {rule.get_name()}" + ruleIDs.add(rule.get_rule_id()) + with c.time_block(f"Calling score_notes: {rule.get_name()}"): + noteStatusUpdates, additionalColumns = rule.score_notes(noteStats, noteLabels, statusColumn) + if ( + additionalColumns is not None + # This rule updates both status and NmrDueToStableCrhTime (in additional column), they can + # be on different rows. + and rule.get_rule_id() != RuleID.NMR_DUE_TO_MIN_STABLE_CRH_TIME + ): + assert set(noteStatusUpdates[c.noteIdKey]) == set(additionalColumns[c.noteIdKey]) + + # Update noteLabels, which will always hold at most one label per note. + unsafeAllowed = {c.internalRatingStatusKey, c.finalRatingStatusKey, c.defaultIndexKey} + noteLabels = ( + pd.concat([noteLabels, noteStatusUpdates], unsafeAllowed=unsafeAllowed) + .groupby(c.noteIdKey) + .tail(1) + ) + # Update note rules to have one row per rule which was active for a note + noteRules = pd.concat( + [ + noteRules, + pd.DataFrame.from_dict( + {c.noteIdKey: noteStatusUpdates[c.noteIdKey], ruleColumn: rule.get_name()} + ), + ], + unsafeAllowed={c.internalActiveRulesKey, c.defaultIndexKey, c.metaScorerActiveRulesKey}, + ) + if additionalColumns is not None: + # Merge any additional columns into current set of new columns + assert {c.noteIdKey} == (set(noteColumns.columns) & set(additionalColumns.columns)) + noteColumns = noteColumns.merge( + additionalColumns, on=c.noteIdKey, how="outer", unsafeAllowed=c.defaultIndexKey + ) + + with c.time_block("Condense noteRules after applying all scoring rules"): + # Having applied all scoring rules, condense noteRules to have one row per note representing + # all of the ScoringRuless which were active for the note. + noteRules = noteRules.groupby(c.noteIdKey).aggregate(list).reset_index() + if decidedByColumn: + noteRules[decidedByColumn] = [rules[-1] for rules in noteRules[ruleColumn]] + noteRules[ruleColumn] = [",".join(activeRules) for activeRules in noteRules[ruleColumn]] + # Validate that there are labels and assigned rules for each note + assert set(noteStats[c.noteIdKey]) == set(noteLabels[c.noteIdKey]) + assert set(noteStats[c.noteIdKey]) == set(noteRules[c.noteIdKey]) + assert len(set(noteColumns[c.noteIdKey]) - set(noteStats[c.noteIdKey])) == 0 + # Merge note labels, active rules and new columns into noteStats to form scoredNotes + scoredNotes = noteStats.merge(noteLabels, on=c.noteIdKey, how="inner") + scoredNotes = scoredNotes.merge(noteRules, on=c.noteIdKey, how="inner") + scoredNotes = scoredNotes.merge(noteColumns, on=c.noteIdKey, how="left") + # Add all of the individual model rules to the active rules column + assert len(scoredNotes) == len(noteStats) + # Set boolean columns indicating scoring outcomes + scoredNotes[c.currentlyRatedHelpfulBoolKey] = ( + scoredNotes[statusColumn] == c.currentlyRatedHelpful ) - if additionalColumns is not None: - # Merge any additional columns into current set of new columns - assert {c.noteIdKey} == (set(noteColumns.columns) & set(additionalColumns.columns)) - noteColumns = noteColumns.merge(additionalColumns, on=c.noteIdKey, how="outer") - - # Having applied all scoring rules, condense noteRules to have one row per note representing - # all of the ScoringRuless which were active for the note. - noteRules = noteRules.groupby(c.noteIdKey).aggregate(list).reset_index() - if decidedByColumn: - noteRules[decidedByColumn] = [rules[-1] for rules in noteRules[ruleColumn]] - noteRules[ruleColumn] = [",".join(activeRules) for activeRules in noteRules[ruleColumn]] - # Validate that there are labels and assigned rules for each note - assert set(noteStats[c.noteIdKey]) == set(noteLabels[c.noteIdKey]) - assert set(noteStats[c.noteIdKey]) == set(noteRules[c.noteIdKey]) - assert len(set(noteColumns[c.noteIdKey]) - set(noteStats[c.noteIdKey])) == 0 - # Merge note labels, active rules and new columns into noteStats to form scoredNotes - scoredNotes = noteStats.merge(noteLabels, on=c.noteIdKey, how="inner") - scoredNotes = scoredNotes.merge(noteRules, on=c.noteIdKey, how="inner") - scoredNotes = scoredNotes.merge(noteColumns, on=c.noteIdKey, how="left") - # Add all of the individual model rules to the active rules column - assert len(scoredNotes) == len(noteStats) - # Set boolean columns indicating scoring outcomes - scoredNotes[c.currentlyRatedHelpfulBoolKey] = scoredNotes[statusColumn] == c.currentlyRatedHelpful - scoredNotes[c.currentlyRatedNotHelpfulBoolKey] = ( - scoredNotes[statusColumn] == c.currentlyRatedNotHelpful - ) - scoredNotes[c.awaitingMoreRatingsBoolKey] = scoredNotes[statusColumn] == c.needsMoreRatings + scoredNotes[c.currentlyRatedNotHelpfulBoolKey] = ( + scoredNotes[statusColumn] == c.currentlyRatedNotHelpful + ) + scoredNotes[c.awaitingMoreRatingsBoolKey] = scoredNotes[statusColumn] == c.needsMoreRatings # Return completed DF including original noteStats signals merged wtih scoring results return scoredNotes diff --git a/sourcecode/scoring/tag_consensus.py b/sourcecode/scoring/tag_consensus.py index 6c1e4b8a..e51b61d6 100644 --- a/sourcecode/scoring/tag_consensus.py +++ b/sourcecode/scoring/tag_consensus.py @@ -1,3 +1,4 @@ +import logging from typing import Optional from . import constants as c, process_data @@ -6,6 +7,10 @@ import pandas as pd +logger = logging.getLogger("birdwatch.tag_consensus") +logger.setLevel(logging.INFO) + + def train_tag_model( ratings: pd.DataFrame, tag: str = c.notHelpfulSpamHarassmentOrAbuseTagKey, @@ -14,16 +19,16 @@ def train_tag_model( useSigmoidCrossEntropy: bool = True, name: Optional[str] = None, ): - print(f"-------------------Training for tag {tag}-------------------") + logger.info(f"-------------------Training for tag {tag}-------------------") ratingDataForTag, labelColName = prepare_tag_data(ratings, tag) if ratingDataForTag is None or len(ratingDataForTag) == 0: - print(f"No valid data for {tag}, returning None and aborting {tag} model training.") + logger.info(f"No valid data for {tag}, returning None and aborting {tag} model training.") return None, None, None posRate = ratingDataForTag[labelColName].sum() / len(ratingDataForTag) - print(f"{tag} Positive Rate: {posRate}") + logger.info(f"{tag} Positive Rate: {posRate}") if pd.isna(posRate) or posRate == 0 or posRate == 1: - print( + logger.info( f"{tag} tag positive rate is {posRate}: returning None and aborting {tag} model training." ) return None, None, None @@ -105,16 +110,16 @@ def prepare_tag_data( # Positives ratings.loc[ratings[tagName] == 1, labelColName] = 1 - print("Pre-filtering tag label breakdown", ratings.groupby(labelColName).size()) - print("Number of rows with no tag label", ratings[labelColName].isnull().sum()) + logger.info(f"Pre-filtering tag label breakdown {ratings.groupby(labelColName).size()}") + logger.info(f"Number of rows with no tag label {ratings[labelColName].isnull().sum()}") # Currently leave in raters who only made one type of rating, but can throw them out in the future. ratings = process_data.filter_ratings( ratings[ratings[labelColName].notnull()], minNumRatingsPerRater, minNumRatersPerNote ) - print("Post-filtering tag label breakdown", ratings.groupby(labelColName).size()) - print("Number of rows with no tag label", ratings[labelColName].isnull().sum()) + logger.info(f"Post-filtering tag label breakdown {ratings.groupby(labelColName).size()}") + logger.info(f"Number of rows with no tag label {ratings[labelColName].isnull().sum()}") ratings[labelColName] = ratings[labelColName].astype(int) diff --git a/sourcecode/scoring/tag_filter.py b/sourcecode/scoring/tag_filter.py index 6eee4afe..c09704ac 100644 --- a/sourcecode/scoring/tag_filter.py +++ b/sourcecode/scoring/tag_filter.py @@ -1,5 +1,6 @@ """Utilites for tag based scoring logic.""" +import logging from typing import Dict from . import constants as c @@ -8,6 +9,10 @@ import pandas as pd +logger = logging.getLogger("birdwatch.tag_filter") +logger.setLevel(logging.INFO) + + def _normalize_factors(rawFactors: pd.DataFrame, entityKey: str, factorKey: str) -> pd.DataFrame: """Performs Z-Normalization on embedding factors. @@ -132,5 +137,11 @@ def get_tag_thresholds(ratings: pd.DataFrame, percentile: int) -> Dict[str, floa """ thresholds = {} for column in c.notHelpfulTagsAdjustedRatioColumns: - thresholds[column] = np.quantile(ratings[column], np.arange(0, 1, 0.01))[percentile] + if len(ratings[column]) == 0: + logger.info( + f"Warning: No ratings for column {column} in get_tag_thresholds. Setting threshold to 0.0 arbitrarily." + ) + thresholds[column] = 0.0 + else: + thresholds[column] = np.quantile(ratings[column], np.arange(0, 1, 0.01))[percentile] return thresholds diff --git a/sourcecode/scoring/topic_model.py b/sourcecode/scoring/topic_model.py index 32ba94c9..c7edb9be 100644 --- a/sourcecode/scoring/topic_model.py +++ b/sourcecode/scoring/topic_model.py @@ -9,7 +9,9 @@ evaluates the efficacy of per-topic note scoring. """ -from typing import List, Tuple +import logging +import re +from typing import List, Optional, Tuple from . import constants as c from .enums import Topics @@ -22,6 +24,10 @@ from sklearn.pipeline import Pipeline +logger = logging.getLogger("birdwatch.topic_model") +logger.setLevel(logging.INFO) + + class TopicModel(object): def __init__(self): """Initialize a list of seed terms for each topic.""" @@ -42,38 +48,50 @@ def __init__(self): "jerusalem", }, Topics.MessiRonaldo: { - "messi ", # intentional whitespace to prevent prefix matches + "messi\s", # intentional whitespace to prevent prefix matches "ronaldo", }, } + self._compiled_regex = self._compile_regex() + + def _compile_regex(self): + """Compile a single regex from all seed terms grouped by topic.""" + regex_patterns = {} + for topic, patterns in self._seedTerms.items(): + group_name = f"{topic.name}" + regex_patterns[group_name] = f"(?P<{group_name}>{'|'.join(patterns)})" + + combined_regex = "|".join(regex_patterns.values()) + return re.compile(combined_regex, re.IGNORECASE) def _make_seed_labels(self, texts: np.ndarray) -> Tuple[np.ndarray, np.ndarray]: """Produce a label vector based on seed terms. - The label vector has type np.int64 with values corresponding to the enum value for - each topic. Any text which matches seed terms from multiple topics is left unassigned. - Args: texts: array containing strings for topic assignment Returns: - Tuple[0]: array specifing topic labels for texts + Tuple[0]: array specifying topic labels for texts Tuple[1]: array specifying texts that are unassigned due to conflicting matches. """ - texts = np.array([text.lower() for text in texts]) labels = np.zeros(texts.shape[0], dtype=np.int64) - conflictedTexts = np.zeros(texts.shape[0]) - for topic in Topics: - if topic == Topics.Unassigned: - continue - topicMatches = np.array( - [any(term in text for term in self._seedTerms[topic]) for text in texts] - ) - labels[topicMatches] = topic.value - conflictedTexts += topicMatches.astype(np.int64) - labels[conflictedTexts > 1] = Topics.Unassigned.value - print(f" Notes unassigned due to multiple matches: {conflictedTexts.sum()}") - return labels, conflictedTexts > 1 + conflictedTexts = np.zeros(texts.shape[0], dtype=bool) + + for i, text in enumerate(texts): + matches = self._compiled_regex.finditer(text.lower()) + found_topics = set() + for match in matches: + found_topics.update([Topics[grp].value for grp in match.groupdict() if match.group(grp)]) + + if len(found_topics) == 1: + labels[i] = found_topics.pop() + elif len(found_topics) > 1: + labels[i] = Topics.Unassigned.value + conflictedTexts[i] = True + + unassigned_count = np.sum(conflictedTexts) + logger.info(f" Notes unassigned due to multiple matches: {unassigned_count}") + return labels, conflictedTexts def _get_stop_words(self, texts: np.ndarray) -> List[str]: """Identify tokens in the extracted vocabulary that contain seed terms. @@ -92,14 +110,15 @@ def _get_stop_words(self, texts: np.ndarray) -> List[str]: cv = CountVectorizer(strip_accents="unicode") cv.fit(texts) rawVocabulary = cv.vocabulary_.keys() - print(f" Initial vocabulary length: {len(rawVocabulary)}") + logger.info(f" Initial vocabulary length: {len(rawVocabulary)}") # Identify stop words blockedTokens = set() for terms in self._seedTerms.values(): - blockedTokens |= {t.strip() for t in terms} - print(f" Total tokens to filter: {len(blockedTokens)}") + # Remove whitespace and any escaped characters from terms + blockedTokens |= {re.sub(r"\\.", "", t.strip()) for t in terms} + logger.info(f" Total tokens to filter: {len(blockedTokens)}") stopWords = [v for v in rawVocabulary if any(t in v for t in blockedTokens)] - print(f" Total identified stopwords: {len(stopWords)}") + logger.info(f" Total identified stopwords: {len(stopWords)}") return stopWords def _merge_predictions_and_labels( @@ -141,7 +160,52 @@ def _prepare_post_text(self, notes: pd.DataFrame) -> pd.DataFrame: ] return postNoteText - def get_note_topics(self, notes: pd.DataFrame) -> pd.DataFrame: + def train_note_topic_classifier( + self, notes: pd.DataFrame + ) -> Tuple[Pipeline, np.ndarray, np.ndarray]: + # Obtain aggregate post text, seed labels and stop words + with c.time_block("Get Note Topics: Prepare Post Text"): + postText = self._prepare_post_text(notes) + + with c.time_block("Get Note Topics: Make Seed Labels"): + seedLabels, conflictedTexts = self._make_seed_labels(postText[c.summaryKey].values) + + with c.time_block("Get Note Topics: Get Stop Words"): + stopWords = self._get_stop_words(postText[c.summaryKey].values) + + with c.time_block("Get Note Topics: Train Model"): + # Define and fit model + pipe = Pipeline( + [ + ( + "UnigramEncoder", + CountVectorizer( + strip_accents="unicode", + stop_words=stopWords, + min_df=25, + max_df=max(1000, int(0.25 * len(postText))), + ), + ), + ("tfidf", TfidfTransformer()), + ("Classifier", LogisticRegression(max_iter=1000, verbose=1)), + ], + verbose=True, + ) + pipe.fit( + # Notice that we omit posts with an unclear label from training. + postText[c.summaryKey].values[~conflictedTexts], + seedLabels[~conflictedTexts], + ) + + return pipe, seedLabels, conflictedTexts + + def get_note_topics( + self, + notes: pd.DataFrame, + noteTopicClassifier: Optional[Pipeline] = None, + seedLabels: Optional[np.ndarray] = None, + conflictedTextsForAccuracyEval: Optional[np.ndarray] = None, + ) -> pd.DataFrame: """Return a DataFrame specifying each {note, topic} pair. Notes that are not assigned to a topic do not appear in the dataframe. @@ -149,49 +213,46 @@ def get_note_topics(self, notes: pd.DataFrame) -> pd.DataFrame: Args: notes: DF containing all notes to potentially assign to a topic """ - print("Assigning notes to topics:") - # Obtain aggregate post text, seed labels and stop words + logger.info("Assigning notes to topics:") + if noteTopicClassifier is not None: + pipe = noteTopicClassifier + else: + logger.info("Training note topic classifier") + pipe, seedLabels, conflictedTextsForAccuracyEval = self.train_note_topic_classifier(notes) postText = self._prepare_post_text(notes) - seedLabels, conflictedTexts = self._make_seed_labels(postText[c.summaryKey].values) - stopWords = self._get_stop_words(postText[c.summaryKey].values) - # Define and fit model - pipe = Pipeline( - [ - ( - "UnigramEncoder", - CountVectorizer( - strip_accents="unicode", - stop_words=stopWords, - min_df=25, - max_df=max(1000, int(0.25 * len(postText))), - ), - ), - ("tfidf", TfidfTransformer()), - ("Classifier", LogisticRegression(max_iter=1000, verbose=1)), - ], - verbose=True, - ) - pipe.fit( - # Notice that we omit posts with an unclear label from training. - postText[c.summaryKey].values[~conflictedTexts], - seedLabels[~conflictedTexts], - ) - # Predict notes. Notice that in effect we are looking to see which notes in the - # training data the model felt were mis-labeled after the training process - # completed, and generating labels for any posts which were omitted from the - # original training. - pred = pipe.predict(postText[c.summaryKey].values) + + with c.time_block("Get Note Topics: Predict"): + # Predict notes. Notice that in effect we are looking to see which notes in the + # training data the model felt were mis-labeled after the training process + # completed, and generating labels for any posts which were omitted from the + # original training. + pred = pipe.predict(postText[c.summaryKey].values) + + if seedLabels is None: + with c.time_block("Get Note Topics: Make Seed Labels"): + seedLabels, _ = self._make_seed_labels(postText[c.summaryKey].values) + + if conflictedTextsForAccuracyEval is not None: + self.validate_note_topic_accuracy_on_seed_labels( + pred, seedLabels, conflictedTextsForAccuracyEval + ) + + with c.time_block("Get Note Topics: Merge and assign predictions"): + pred = self._merge_predictions_and_labels(pred, seedLabels) + logger.info(f" Topic assignment results: {np.bincount(pred)}") + + # Assign topics to notes based on aggregated note text, and drop any + # notes on posts that were unassigned. + postText[c.noteTopicKey] = [Topics(t).name for t in pred] + postText = postText[postText[c.noteTopicKey] != Topics.Unassigned.name] + noteTopics = notes[[c.noteIdKey, c.tweetIdKey]].merge( + postText[[c.tweetIdKey, c.noteTopicKey]] + ) + return noteTopics.drop(columns=c.tweetIdKey) + + def validate_note_topic_accuracy_on_seed_labels(self, pred, seedLabels, conflictedTexts): balancedAccuracy = balanced_accuracy_score(seedLabels[~conflictedTexts], pred[~conflictedTexts]) - print(f" Balanced accuracy on raw predictions: {balancedAccuracy}") + logger.info(f" Balanced accuracy on raw predictions: {balancedAccuracy}") assert balancedAccuracy > 0.5, f"Balanced accuracy too low: {balancedAccuracy}" - # Validate that any conflicted text is Unassigned + # Validate that any conflicted text is Unassigned in seedLabels assert all(seedLabels[conflictedTexts] == Topics.Unassigned.value) - pred = self._merge_predictions_and_labels(pred, seedLabels) - print(f" Topic assignment results: {np.bincount(pred)}") - - # Assign topics to notes based on aggregated note text, and drop any - # notes on posts that were unassigned. - postText[c.noteTopicKey] = [Topics(t).name for t in pred] - postText = postText[postText[c.noteTopicKey] != Topics.Unassigned.name] - noteTopics = notes[[c.noteIdKey, c.tweetIdKey]].merge(postText[[c.tweetIdKey, c.noteTopicKey]]) - return noteTopics.drop(columns=c.tweetIdKey)