Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries and Migrations #4

Open
s2t2 opened this issue Sep 18, 2023 · 14 comments
Open

Queries and Migrations #4

s2t2 opened this issue Sep 18, 2023 · 14 comments

Comments

@s2t2
Copy link
Owner

s2t2 commented Sep 18, 2023

Queries for de-duping the data and setting up for analysis. Later we can formalize these migrations into the codebase.

@s2t2
Copy link
Owner Author

s2t2 commented Sep 18, 2023

De-duplicating

SELECT status_id, count(*) as row_count 
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` 
GROUP BY 1
ORDER BY 2 DESC

-- DE-DUP THESE

@s2t2
Copy link
Owner Author

s2t2 commented Sep 18, 2023

Analyzing dups and their patterns:

SELECT status_id, count(*) as row_count
FROM (

    SELECT DISTINCT 
      status_id, status_text, created_at, lang, 
      user_id, retweet_status_id, reply_status_id, quote_status_id, conversation_id
      --, matching_rule_ids
    FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`

) 
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
-- example statuses with dups after removing distinct dups
-- 1635601440512290816
-- 1644393606159925260	
-- 1637543665307635712
-- 1636111994800549889
-- 1645932927761494018	
-- 1640107946322325504
-- 1637117907938451457
-- 1637644471696343040
-- 1642432953438593025
    
SELECT DISTINCT 
      status_id, status_text, created_at, lang, 
      user_id, retweet_status_id, reply_status_id, quote_status_id, conversation_id
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`
WHERE status_id = 1637530485386625024 -- 1636111994800549889 -- 1635601440512290816 -- one row has quote_status_id and the other is null
-- we also saw some where one has language of "es" and the other has lang of "en" (we want to aggregate lang into an array of all the langs)

@s2t2
Copy link
Owner Author

s2t2 commented Sep 18, 2023

Migration for tweets_v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v2` as (
  -- now we will also de-dupe based on presence or absence of certain identifiers, taking any non-null value
  --- ... and aggregate the langs because we saw there can be more than one for the same tweet
  SELECT status_id, status_text, created_at, user_id
    ,any_value(retweet_status_id) as retweet_status_id
    ,any_value(reply_status_id) as reply_status_id
    ,any_value(quote_status_id) as quote_status_id
    ,any_value(conversation_id) as conversation_id
    ,string_agg(DISTINCT lang, ", " ) as langs
  FROM (
    -- first step is to de-dupe the rows that are exactly the same
    SELECT DISTINCT status_id, status_text, created_at, lang, 
          user_id, retweet_status_id, reply_status_id, quote_status_id, conversation_id
          -- todo: add streaming_rules_ids as well (but they don't play well with the distinct operator)
    FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`
    -- WHERE status_id = 1644393606159925260 -- one quote one null quote
    --WHERE status_id = 1590401850268090369 -- many langs
  )
  GROUP BY 1,2,3,4
);

Verifying we have the same number of rows as before:

-- SELECT count(distinct status_id) as status_count 
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` -- 9676610
SELECT count(distinct status_id) as status_count 
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`; -- 9676610

Status Rules

Streaming rules table (row per tweet per rule). This way we can identify any statuses collected using off-topic rules (in case developers or previous assistants may have set up another stream listener thinking the rules would be kept separate but actually some tweets from another collection effort wound up in here 😅

-- SELECT status_id
--   ,array_length(matching_rule_ids) as rules_count
--   ,matching_rule_ids
-- FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` 
-- ORDER BY 2 DESC -- status 1646592462406799373 has 12 matching rules
-- LIMIT 10

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweet_rules`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweet_rules` as (
  SELECT DISTINCT status_id,rule_id
  FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`,
    UNNEST(matching_rule_ids) as rule_id
  -- WHERE status_id = 1646592462406799373 -- 12 rows, awesome
);

Analysis:

SELECT rule_id, count(distinct status_id) as status_count
FROM `tweet-research-shared.f1_racing_2023.streaming_tweet_rules`
GROUP BY 1
ORDER BY 2 DESC
rule_id status_count
1634579704828117005 5196469
1634579704828117001 2458015
1634579704828117008 892310
1634579704828116999 742394
1634579704828117004 632647
1634579704828116995 537927
1634579704828116993 458885
1634579704828117002 402241
1634579704828117003 311906
1634579704828116992 246958
1634579704828116998 151988
1634579704828117006 124075
1634579704828116994 118209
1634579704828117007 77981
1634579704828117000 1004

@JiazanShi
Copy link

drop duplicates in streaming_users and keep the most recent users profile data in streaming_users_v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` ;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` AS (
  SELECT *
  FROM (
    SELECT * ,
        ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY accessed_at DESC) AS rnum
    FROM `tweet-research-shared.f1_racing_2023.streaming_users` 
  )
  WHERE rnum = 1
)

verify the data is correct:

SELECT COUNT(DISTINCT user_id)
FROM `tweet-research-shared.f1_racing_2023.streaming_users`  -- 1384754

SELECT COUNT(DISTINCT user_id)
FROM `tweet-research-shared.f1_racing_2023.streaming_users_v2` -- 1384754

@s2t2
Copy link
Owner Author

s2t2 commented Sep 23, 2023

Status mentions v2 (row per each user mentioned in a given status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2` as (
  SELECT DISTINCT status_id, user_id, UPPER(user_screen_name) as user_screen_name
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_mentions` sm
  --LIMIT 10
)

Status hashtags v2 (row per hastag per status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2` as (
  SELECT DISTINCT status_id, UPPER(tag) as tag
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_hashtags` sh
  --LIMIT 10
)

Status urls v2 (row per status per url, with domains parsed 😸 ):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2` as (

  SELECT status_id, url, 
      regexp_replace(
            split(split(url, "://")[OFFSET(1)], "/")[OFFSET(0)]
            , "www.", ""
      ) as url_domain
  FROM (
      SELECT DISTINCT status_id, url
      FROM `tweet-research-shared.f1_racing_2023.streaming_status_urls` su
      --LIMIT 10
  )
)

Status entities v2 (row per status per named entity) -- named entity recognition (todo: import the lookup tables):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2` as (
  -- HERE ARE THE CONTEXT DOMAIN AND ENTITY LOOKUP TABLES
  -- https://developer.twitter.com/en/docs/twitter-api/annotations/overview
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-domains-20220601.csv
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-entities-20220601.csv
  -- 
  SELECT DISTINCT status_id, domain_id, entity_id
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_entities`
  -- LIMIT 10
)

Status annotations v2 (row per status per annotation type and text):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2` as (
  -- annotations with given type (Other, Organization), and annotation text
  -- count and aggregate the non-dup probability scores
  SELECT status_id, type, text
      ,min(probability) as min_probability
      ,max(probability) as max_probability -- UPDATED TO MAX :-)
      ,avg(probability) as avg_probability
      ,count(*) as row_count
  FROM (
    -- remove dups first
    SELECT DISTINCT status_id, type, upper(text) as text, probability
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_annotations`
  )
  GROUP BY 1,2,3
  ORDER BY 7 DESC
)

@s2t2
Copy link
Owner Author

s2t2 commented Sep 23, 2023

Streaming media (row per unique "media key"):

-- SELECT media_key, count(*) as row_count
-- FROM `tweet-research-shared.f1_racing_2023.streaming_media`
-- GROUP BY 1 
-- ORDER BY 2 DESC
-- LIMIT 10
-- hmm there are a few dups like 13_1662438835244900352, 13_1655282075690323974, 13_1637199039660961793

--SELECT *
--FROM `tweet-research-shared.f1_racing_2023.streaming_media`
--WHERE media_key = '13_1662438835244900352' -- hmm this same media key has different preview image urls and widths?
--
--SELECT *
--FROM `tweet-research-shared.f1_racing_2023.streaming_media`
--WHERE media_key = '13_1655282075690323974' -- hmm this same media key has different URLs and widths?
--
--SELECT *
--FROM `tweet-research-shared.f1_racing_2023.streaming_media`
--WHERE media_key = '13_1637199039660961793' 
-- -- hmm this same media key has different preview image URLs : 
-- -- https://pbs.twimg.com/media/FriSDJAaMAARIdG.jpg, 
-- -- https://pbs.twimg.com/media/FriQWelaAAAwlWR.jpg
-- -- in some cases the preview images seem complimentary, so we can try to aggregate them

-- oh there are still some dups like: 13_1666420362333114370
-- SELECT *
-- FROM `tweet-research-shared.f1_racing_2023.streaming_media_v2`
-- WHERE media_key = '13_1666420362333114370' -- there are different widths let's not care about the heights and widths




DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_media_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_media_v2` as (
    SELECT media_key, type, url
        ,duration_ms 
        --,height, width ,alt_text
        , string_agg(preview_image_url, "|") as preview_image_urls
    FROM (
      -- remove dups
      SELECT DISTINCT media_key, type, url, preview_image_url, alt_text, duration_ms, height, width,
      FROM `tweet-research-shared.f1_racing_2023.streaming_media`
    )
    GROUP BY 1,2,3,4 --,5,6, 7

)


-- verify row per unique media key (no media keys should have more than one row)
SELECT media_key, count(*) as row_count
FROM `tweet-research-shared.f1_racing_2023.streaming_media_v2`
GROUP BY 1 
ORDER BY 2 DESC
LIMIT 10

@s2t2
Copy link
Owner Author

s2t2 commented Sep 23, 2023

NOTE: I believe these tables represent tags and mentions in user profiles. So we store earliest and latest date seen in the profile. Counting number of times accessed is not that important.

Streaming user hashtags (row per user per tag):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_hashtags_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_hashtags_v2` as (
  SELECT user_id, UPPER(tag) as tag
    ,min(date(accessed_at)) as earliest_access
    ,max(date(accessed_at)) as latest_access
  FROM `tweet-research-shared.f1_racing_2023.streaming_user_hashtags`
  GROUP BY 1, 2
  -- LIMIT 10
)

User hashtags (row per user per tag):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_mentions_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_mentions_v2` as (
    SELECT user_id, UPPER(mention_screen_name) as mention_screen_name
      ,min(date(accessed_at)) as earliest_access
      ,max(date(accessed_at)) as latest_access
    FROM `tweet-research-shared.f1_racing_2023.streaming_user_mentions`
    GROUP BY 1,2
    -- LIMIT 10
)

@s2t2
Copy link
Owner Author

s2t2 commented Sep 23, 2023

News Quality

Lookup table:

CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.domain_fact_scores` as (
  -- GRABBING THE FACT SCORES LOOKUP TABLE PREVIOUSLY UPLOADED TO ANOTHER DATASET
  SELECT *
  FROM `tweet-research-shared.impeachment_2020.domain_fact_scores`
)

Assigning a fact score using the lookup table (FYI not all statuses share urls, and thus many will not have fact scores):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` as (
   SELECT
      urls.status_id
      ,urls.url
      ,urls.url_domain
      ,facts.domain as fact_domain
      ,facts.fact_score
      ,facts.fact_category
   FROM `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2` urls
   JOIN `tweet-research-shared.f1_racing_2023.domain_fact_scores` facts ON facts.domain = urls.url_domain
   --ORDER BY fact_score DESC
   --LIMIT 10
)

Aggregated user fact scores:

-- calling this user_fact_scores instead of streaming_status_user_fact_scores 
-- (the streaming_user namespace is currently about user profiles)
DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.user_fact_scores`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.user_fact_scores` as (
    SELECT
        t.user_id
        ,count(distinct f.status_id) as status_count
        ,avg(f.fact_score) as avg_fact_score
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` f
    JOIN `tweet-research-shared.f1_racing_2023.streaming_tweets_v2` t on t.status_id = f.status_id
    GROUP BY 1
    --ORDER BY 2 DESC
    --LIMIT 50
);

Analysis (assess the percentage of tweets for which we have fact scores):

SELECT count(distinct t.status_id) as status_count
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2` t -- 9_676_610
--JOIN `tweet-research-shared.f1_racing_2023.streaming_status_urls` urls ON urls.status_id = t.status_id -- 4_948_138
--JOIN `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` f on f.status_id = t.status_id -- 977

SELECT count(distinct status_id) as status_count
FROM `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` -- 977 SO LOW!

@s2t2
Copy link
Owner Author

s2t2 commented Sep 24, 2023

Revisiting users_v2 migration:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` ;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` AS (
  SELECT 
    u.user_id
    ,date(u.created_at) as created_on
    
      -- CURRENT / LATEST:

      ,latest_row.screen_name
      ,latest_row.name 
      ,latest_row.description
      ,latest_row.url
      ,latest_row.profile_image_url
      ,latest_row.verified
      ,latest_row.pinned_tweet_id
      ,latest_row.followers_count
      ,latest_row.following_count
      ,latest_row.tweet_count
      ,latest_row.listed_count

      -- AGGREGATES:

      ,min(date(u.accessed_at)) as first_access
      ,max(date(u.accessed_at)) as latest_access

      ,count(distinct upper(u.screen_name)) as screen_name_count
      ,array_agg(distinct upper(u.screen_name) IGNORE NULLS) as screen_names

      ,count(distinct upper(u.name)) as names_count
      ,array_agg(distinct UPPER(u.name) IGNORE NULLS) as names 

      ,count(distinct u.description) as descriptions_count
      ,string_agg(distinct u.description, " | ") as descriptions
    
      ,count(distinct u.url) as urls_count
      --,array_agg(DISTINCT url IGNORE NULLS) as urls

      ,count(distinct u.profile_image_url) as profile_image_urls_count
      --,array_agg(profile_image_url IGNORE NULLS) as profile_image_urls

  FROM `tweet-research-shared.f1_racing_2023.streaming_users` u
  JOIN (
    -- FOR EACH USER, GET THE LATEST VALUES
    SELECT latest.user_id, latest.last_accessed_at
      ,upper(u.screen_name) as screen_name
      ,u.name as name
      ,u.description
      ,u.url
      ,u.profile_image_url
      ,u.verified
      ,u.pinned_tweet_id
      ,u.followers_count
      ,u.following_count
      ,u.tweet_count
      ,u.listed_count
    FROM`tweet-research-shared.f1_racing_2023.streaming_users` u 
    JOIN (
      -- FOR EACH USER, WHAT IS THE TIMESTAMP OF THE LATEST ROW?
      SELECT user_id, max(accessed_at) as last_accessed_at
      FROM `tweet-research-shared.f1_racing_2023.streaming_users` -- 1384754
      GROUP BY 1
    ) latest on latest.last_accessed_at = u.accessed_at and latest.user_id = u.user_id  -- assumes no dups on these composite key fields
    --LIMIT 10
  ) latest_row on latest_row.user_id = u.user_id
  GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13
)

what a beast

@s2t2
Copy link
Owner Author

s2t2 commented Sep 26, 2023

Twitter Named Entity Recognition

Twitter Context Entities and their corresponding Domains:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.context_entities`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.context_entities` as (
  SELECT * 
  FROM `tweet-collector-py.jan6_committee_development.entities` 
  -- where I had previously uploaded this CSV file from twitterdev:
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-entities-20220601.csv
);
DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.context_domains`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.context_domains` as (
  SELECT * 
  FROM `tweet-collector-py.jan6_committee_development.domains` 
  -- where I had previously uploaded this CSV file from twitterdev:
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-domains-20220601.csv
);

Joining and cleaning:

--SELECT e.entity_id, e.entity_name, e.domain_ids
-- FROM `tweet-research-shared.f1_racing_2023.context_entities` e -- 144,753
-- some entities have multiple domains. for example:
-- 1139229372198469633 Dogecoin cryptocurrency | domains: 131, 174

-- SELECT count(distinct entity_id) as entity_count 
-- FROM `tweet-research-shared.f1_racing_2023.context_entities` -- 144,731

-- SELECT DISTINCT entity_id,entity_name, domain_id
-- FROM `tweet-research-shared.f1_racing_2023.context_entities`,
-- UNNEST(domain_ids) as domain_id
-- WHERE entity_id = 1139229372198469633 -- 2 rows, great

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.context_entities_v2`;
CREATE TABLE `tweet-research-shared.f1_racing_2023.context_entities_v2` as (
    SELECT e.entity_id, e.entity_name, e.domain_id, d.domain_name
    FROM (
        SELECT DISTINCT entity_id,entity_name, domain_id
        FROM `tweet-research-shared.f1_racing_2023.context_entities`,
        UNNEST(domain_ids) as domain_id
        --WHERE entity_id = 1139229372198469633 -- 2 rows, great
    ) e
    LEFT JOIN `tweet-research-shared.f1_racing_2023.context_domains` d ON d.domain_id = e.domain_id
);

Analysis:

SELECT domain_id, domain_name, count(distinct entity_id) as entity_count
FROM `tweet-research-shared.f1_racing_2023.context_entities_v2`
-- WHERE domain_name is null -- 0 awesome
GROUP BY 1,2
ORDER BY 3 DESC
-- 31 Community: 71215
-- 131 Unified Twitter Taxonomy: 47499
-- 10 Person:  43188
-- 60 Athlete: 17960

@JiazanShi
Copy link

Status mentions v2 (row per each user mentioned in a given status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2` as (
  SELECT DISTINCT status_id, user_id, UPPER(user_screen_name) as user_screen_name
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_mentions` sm
  --LIMIT 10
)

Status hashtags v2 (row per hastag per status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2` as (
  SELECT DISTINCT status_id, UPPER(tag) as tag
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_hashtags` sh
  --LIMIT 10
)

Status urls v2 (row per status per url, with domains parsed 😸 ):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2` as (

  SELECT status_id, url, 
      regexp_replace(
            split(split(url, "://")[OFFSET(1)], "/")[OFFSET(0)]
            , "www.", ""
      ) as url_domain
  FROM (
      SELECT DISTINCT status_id, url
      FROM `tweet-research-shared.f1_racing_2023.streaming_status_urls` su
      --LIMIT 10
  )
)

Status entities v2 (row per status per named entity) -- named entity recognition (todo: import the lookup tables):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2` as (
  -- HERE ARE THE CONTEXT DOMAIN AND ENTITY LOOKUP TABLES
  -- https://developer.twitter.com/en/docs/twitter-api/annotations/overview
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-domains-20220601.csv
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-entities-20220601.csv
  -- 
  SELECT DISTINCT status_id, domain_id, entity_id
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_entities`
  -- LIMIT 10
)

Status annotations v2 (row per status per annotation type and text):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2` as (
  -- annotations with given type (Other, Organization), and annotation text
  -- count and aggregate the non-dup probability scores
  SELECT status_id, type, text
      ,min(probability) as min_probability
      ,min(probability) as max_probability
      ,avg(probability) as avg_probability
      ,count(*) as row_count
  FROM (
    -- remove dups first
    SELECT DISTINCT status_id, type, upper(text) as text, probability
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_annotations`
  )
  GROUP BY 1,2,3
  ORDER BY 7 DESC
)

update migration query for status annotation_v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2` as (
  -- annotations with given type (Other, Organization), and annotation text
  -- count and aggregate the non-dup probability scores
  SELECT status_id, type, text
      ,MIN(probability) as min_probability
      ,MAX(probability) as max_probability
      ,AVG(probability) as avg_probability
      ,COUNT(*) as row_count
  FROM (
    -- remove dups first
    SELECT DISTINCT status_id, type, upper(text) as text, probability
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_annotations`
  )
  GROUP BY 1,2,3
  ORDER BY 7 DESC
)

@s2t2 s2t2 changed the title Queries Queries and Migrations Sep 29, 2023
@s2t2
Copy link
Owner Author

s2t2 commented Sep 29, 2023

Streaming Rules

Matching rule names with their identifiers, and bumping the rules table:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_rules_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_rules_v2` as (
    SELECT 
        CASE -- we manually matched the names to ids after inspecting tweets matching the given rule id
            WHEN rule = '#F1' THEN 1634579704828117005
            WHEN rule = '@F1' THEN 1634579704828117001
            WHEN rule = '@MercedesAMGF1' THEN 1634579704828117008
            WHEN rule = '@redbullracing' THEN 1634579704828116999
            WHEN rule = '@ScuderiaFerrari' THEN 1634579704828117004
            WHEN rule = '@AstonMartinF1' THEN 1634579704828116995
            WHEN rule = '#Formula1' THEN 1634579704828116993
            WHEN rule = '@ESPNF1' THEN 1634579704828117002
            WHEN rule = '@McLarenF1' THEN 1634579704828117003
            WHEN rule = '@SkySportsF1' THEN 1634579704828116992
            WHEN rule = '@AlpineF1Team' THEN 1634579704828116998
            WHEN rule = '@WilliamsRacing' THEN 1634579704828117006
            WHEN rule = '@AlphaTauriF1' THEN 1634579704828116994
            WHEN rule = '@HaasF1Team' THEN 1634579704828117007
            WHEN rule = '@AlfaRomeoF1' THEN 1634579704828117000
        END rule_id
        ,rule as rule_name
        , created_at
    FROM `tweet-research-shared.f1_racing_2023.streaming_rules`
);

@JiazanShi
Copy link

de-dup query for tweets_v2:

SELECT COUNT(*)
FROM(
SELECT DISTINCT status_id,
    ANY_VALUE(status_text) AS status_text,
    ANY_VALUE(created_at) AS created_at,
    ANY_VALUE(user_id) AS user_id,
    ANY_VALUE(retweet_status_id) AS retweet_status_id,
    ANY_VALUE(reply_status_id) AS reply_status_id,
    ANY_VALUE(quote_status_id) AS quote_status_id,
    ANY_VALUE(conversation_id) AS conversation_id,
    ANY_VALUE(langs) AS langs
  FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`
  GROUP BY 1
)
-- 9676610

SELECT COUNT(DISTINCT status_id)
 FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` 
 -- 9676610 
-- we have 9676610 unique id from the original dataset and now the number of unique id in tweets_2 should be identical

@JiazanShi
Copy link

tweets dataset update to 'tweets_v3' to further de dups from v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v3`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v3` AS (
  SELECT DISTINCT status_id,
    ANY_VALUE(status_text) AS status_text,
    ANY_VALUE(created_at) AS created_at,
    ANY_VALUE(user_id) AS user_id,
    ANY_VALUE(retweet_status_id) AS retweet_status_id,
    ANY_VALUE(reply_status_id) AS reply_status_id,
    ANY_VALUE(quote_status_id) AS quote_status_id,
    ANY_VALUE(conversation_id) AS conversation_id,
    ANY_VALUE(langs) AS langs
  FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`
  GROUP BY 1
);

-- check
SELECT COUNT(*), COUNT(DISTINCT status_id)
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v3`
-- 9676610, 9676610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants