Queries and Migrations #4

s2t2 · 2023-09-18T15:04:00Z

Queries for de-duping the data and setting up for analysis. Later we can formalize these migrations into the codebase.

s2t2 · 2023-09-18T15:04:11Z

De-duplicating

SELECT status_id, count(*) as row_count 
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` 
GROUP BY 1
ORDER BY 2 DESC

-- DE-DUP THESE

s2t2 · 2023-09-18T15:43:33Z

Analyzing dups and their patterns:

SELECT status_id, count(*) as row_count
FROM (

    SELECT DISTINCT 
      status_id, status_text, created_at, lang, 
      user_id, retweet_status_id, reply_status_id, quote_status_id, conversation_id
      --, matching_rule_ids
    FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`

) 
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
-- example statuses with dups after removing distinct dups
-- 1635601440512290816
-- 1644393606159925260	
-- 1637543665307635712
-- 1636111994800549889
-- 1645932927761494018	
-- 1640107946322325504
-- 1637117907938451457
-- 1637644471696343040
-- 1642432953438593025
    
SELECT DISTINCT 
      status_id, status_text, created_at, lang, 
      user_id, retweet_status_id, reply_status_id, quote_status_id, conversation_id
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`
WHERE status_id = 1637530485386625024 -- 1636111994800549889 -- 1635601440512290816 -- one row has quote_status_id and the other is null
-- we also saw some where one has language of "es" and the other has lang of "en" (we want to aggregate lang into an array of all the langs)

s2t2 · 2023-09-18T16:33:37Z

Migration for tweets_v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v2` as (
  -- now we will also de-dupe based on presence or absence of certain identifiers, taking any non-null value
  --- ... and aggregate the langs because we saw there can be more than one for the same tweet
  SELECT status_id, status_text, created_at, user_id
    ,any_value(retweet_status_id) as retweet_status_id
    ,any_value(reply_status_id) as reply_status_id
    ,any_value(quote_status_id) as quote_status_id
    ,any_value(conversation_id) as conversation_id
    ,string_agg(DISTINCT lang, ", " ) as langs
  FROM (
    -- first step is to de-dupe the rows that are exactly the same
    SELECT DISTINCT status_id, status_text, created_at, lang, 
          user_id, retweet_status_id, reply_status_id, quote_status_id, conversation_id
          -- todo: add streaming_rules_ids as well (but they don't play well with the distinct operator)
    FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`
    -- WHERE status_id = 1644393606159925260 -- one quote one null quote
    --WHERE status_id = 1590401850268090369 -- many langs
  )
  GROUP BY 1,2,3,4
);

Verifying we have the same number of rows as before:

-- SELECT count(distinct status_id) as status_count 
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` -- 9676610

SELECT count(distinct status_id) as status_count 
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`; -- 9676610

Status Rules

Streaming rules table (row per tweet per rule). This way we can identify any statuses collected using off-topic rules (in case developers or previous assistants may have set up another stream listener thinking the rules would be kept separate but actually some tweets from another collection effort wound up in here 😅

-- SELECT status_id
--   ,array_length(matching_rule_ids) as rules_count
--   ,matching_rule_ids
-- FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` 
-- ORDER BY 2 DESC -- status 1646592462406799373 has 12 matching rules
-- LIMIT 10

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweet_rules`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweet_rules` as (
  SELECT DISTINCT status_id,rule_id
  FROM `tweet-research-shared.f1_racing_2023.streaming_tweets`,
    UNNEST(matching_rule_ids) as rule_id
  -- WHERE status_id = 1646592462406799373 -- 12 rows, awesome
);

Analysis:

SELECT rule_id, count(distinct status_id) as status_count
FROM `tweet-research-shared.f1_racing_2023.streaming_tweet_rules`
GROUP BY 1
ORDER BY 2 DESC

rule_id	status_count
1634579704828117005	5196469
1634579704828117001	2458015
1634579704828117008	892310
1634579704828116999	742394
1634579704828117004	632647
1634579704828116995	537927
1634579704828116993	458885
1634579704828117002	402241
1634579704828117003	311906
1634579704828116992	246958
1634579704828116998	151988
1634579704828117006	124075
1634579704828116994	118209
1634579704828117007	77981
1634579704828117000	1004

JiazanShi · 2023-09-19T14:15:20Z

drop duplicates in streaming_users and keep the most recent users profile data in streaming_users_v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` ;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` AS (
  SELECT *
  FROM (
    SELECT * ,
        ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY accessed_at DESC) AS rnum
    FROM `tweet-research-shared.f1_racing_2023.streaming_users` 
  )
  WHERE rnum = 1
)

verify the data is correct:

SELECT COUNT(DISTINCT user_id)
FROM `tweet-research-shared.f1_racing_2023.streaming_users`  -- 1384754

SELECT COUNT(DISTINCT user_id)
FROM `tweet-research-shared.f1_racing_2023.streaming_users_v2` -- 1384754

s2t2 · 2023-09-23T18:33:04Z

Status mentions v2 (row per each user mentioned in a given status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2` as (
  SELECT DISTINCT status_id, user_id, UPPER(user_screen_name) as user_screen_name
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_mentions` sm
  --LIMIT 10
)

Status hashtags v2 (row per hastag per status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2` as (
  SELECT DISTINCT status_id, UPPER(tag) as tag
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_hashtags` sh
  --LIMIT 10
)

Status urls v2 (row per status per url, with domains parsed 😸 ):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2` as (

  SELECT status_id, url, 
      regexp_replace(
            split(split(url, "://")[OFFSET(1)], "/")[OFFSET(0)]
            , "www.", ""
      ) as url_domain
  FROM (
      SELECT DISTINCT status_id, url
      FROM `tweet-research-shared.f1_racing_2023.streaming_status_urls` su
      --LIMIT 10
  )
)

Status entities v2 (row per status per named entity) -- named entity recognition (todo: import the lookup tables):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2` as (
  -- HERE ARE THE CONTEXT DOMAIN AND ENTITY LOOKUP TABLES
  -- https://developer.twitter.com/en/docs/twitter-api/annotations/overview
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-domains-20220601.csv
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-entities-20220601.csv
  -- 
  SELECT DISTINCT status_id, domain_id, entity_id
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_entities`
  -- LIMIT 10
)

Status annotations v2 (row per status per annotation type and text):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2` as (
  -- annotations with given type (Other, Organization), and annotation text
  -- count and aggregate the non-dup probability scores
  SELECT status_id, type, text
      ,min(probability) as min_probability
      ,max(probability) as max_probability -- UPDATED TO MAX :-)
      ,avg(probability) as avg_probability
      ,count(*) as row_count
  FROM (
    -- remove dups first
    SELECT DISTINCT status_id, type, upper(text) as text, probability
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_annotations`
  )
  GROUP BY 1,2,3
  ORDER BY 7 DESC
)

s2t2 · 2023-09-23T19:17:10Z

Streaming media (row per unique "media key"):

-- SELECT media_key, count(*) as row_count
-- FROM `tweet-research-shared.f1_racing_2023.streaming_media`
-- GROUP BY 1 
-- ORDER BY 2 DESC
-- LIMIT 10
-- hmm there are a few dups like 13_1662438835244900352, 13_1655282075690323974, 13_1637199039660961793

--SELECT *
--FROM `tweet-research-shared.f1_racing_2023.streaming_media`
--WHERE media_key = '13_1662438835244900352' -- hmm this same media key has different preview image urls and widths?
--
--SELECT *
--FROM `tweet-research-shared.f1_racing_2023.streaming_media`
--WHERE media_key = '13_1655282075690323974' -- hmm this same media key has different URLs and widths?
--
--SELECT *
--FROM `tweet-research-shared.f1_racing_2023.streaming_media`
--WHERE media_key = '13_1637199039660961793' 
-- -- hmm this same media key has different preview image URLs : 
-- -- https://pbs.twimg.com/media/FriSDJAaMAARIdG.jpg, 
-- -- https://pbs.twimg.com/media/FriQWelaAAAwlWR.jpg
-- -- in some cases the preview images seem complimentary, so we can try to aggregate them

-- oh there are still some dups like: 13_1666420362333114370
-- SELECT *
-- FROM `tweet-research-shared.f1_racing_2023.streaming_media_v2`
-- WHERE media_key = '13_1666420362333114370' -- there are different widths let's not care about the heights and widths




DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_media_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_media_v2` as (
    SELECT media_key, type, url
        ,duration_ms 
        --,height, width ,alt_text
        , string_agg(preview_image_url, "|") as preview_image_urls
    FROM (
      -- remove dups
      SELECT DISTINCT media_key, type, url, preview_image_url, alt_text, duration_ms, height, width,
      FROM `tweet-research-shared.f1_racing_2023.streaming_media`
    )
    GROUP BY 1,2,3,4 --,5,6, 7

)


-- verify row per unique media key (no media keys should have more than one row)
SELECT media_key, count(*) as row_count
FROM `tweet-research-shared.f1_racing_2023.streaming_media_v2`
GROUP BY 1 
ORDER BY 2 DESC
LIMIT 10

s2t2 · 2023-09-23T19:24:40Z

NOTE: I believe these tables represent tags and mentions in user profiles. So we store earliest and latest date seen in the profile. Counting number of times accessed is not that important.

Streaming user hashtags (row per user per tag):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_hashtags_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_hashtags_v2` as (
  SELECT user_id, UPPER(tag) as tag
    ,min(date(accessed_at)) as earliest_access
    ,max(date(accessed_at)) as latest_access
  FROM `tweet-research-shared.f1_racing_2023.streaming_user_hashtags`
  GROUP BY 1, 2
  -- LIMIT 10
)

User hashtags (row per user per tag):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_mentions_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_user_mentions_v2` as (
    SELECT user_id, UPPER(mention_screen_name) as mention_screen_name
      ,min(date(accessed_at)) as earliest_access
      ,max(date(accessed_at)) as latest_access
    FROM `tweet-research-shared.f1_racing_2023.streaming_user_mentions`
    GROUP BY 1,2
    -- LIMIT 10
)

s2t2 · 2023-09-23T19:59:32Z

News Quality

Lookup table:

CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.domain_fact_scores` as (
  -- GRABBING THE FACT SCORES LOOKUP TABLE PREVIOUSLY UPLOADED TO ANOTHER DATASET
  SELECT *
  FROM `tweet-research-shared.impeachment_2020.domain_fact_scores`
)

Assigning a fact score using the lookup table (FYI not all statuses share urls, and thus many will not have fact scores):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` as (
   SELECT
      urls.status_id
      ,urls.url
      ,urls.url_domain
      ,facts.domain as fact_domain
      ,facts.fact_score
      ,facts.fact_category
   FROM `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2` urls
   JOIN `tweet-research-shared.f1_racing_2023.domain_fact_scores` facts ON facts.domain = urls.url_domain
   --ORDER BY fact_score DESC
   --LIMIT 10
)

Aggregated user fact scores:

-- calling this user_fact_scores instead of streaming_status_user_fact_scores 
-- (the streaming_user namespace is currently about user profiles)
DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.user_fact_scores`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.user_fact_scores` as (
    SELECT
        t.user_id
        ,count(distinct f.status_id) as status_count
        ,avg(f.fact_score) as avg_fact_score
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` f
    JOIN `tweet-research-shared.f1_racing_2023.streaming_tweets_v2` t on t.status_id = f.status_id
    GROUP BY 1
    --ORDER BY 2 DESC
    --LIMIT 50
);

Analysis (assess the percentage of tweets for which we have fact scores):

SELECT count(distinct t.status_id) as status_count
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2` t -- 9_676_610
--JOIN `tweet-research-shared.f1_racing_2023.streaming_status_urls` urls ON urls.status_id = t.status_id -- 4_948_138
--JOIN `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` f on f.status_id = t.status_id -- 977

SELECT count(distinct status_id) as status_count
FROM `tweet-research-shared.f1_racing_2023.streaming_status_fact_scores` -- 977 SO LOW!

s2t2 · 2023-09-24T04:38:58Z

Revisiting users_v2 migration:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` ;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_users_v2` AS (
  SELECT 
    u.user_id
    ,date(u.created_at) as created_on
    
      -- CURRENT / LATEST:

      ,latest_row.screen_name
      ,latest_row.name 
      ,latest_row.description
      ,latest_row.url
      ,latest_row.profile_image_url
      ,latest_row.verified
      ,latest_row.pinned_tweet_id
      ,latest_row.followers_count
      ,latest_row.following_count
      ,latest_row.tweet_count
      ,latest_row.listed_count

      -- AGGREGATES:

      ,min(date(u.accessed_at)) as first_access
      ,max(date(u.accessed_at)) as latest_access

      ,count(distinct upper(u.screen_name)) as screen_name_count
      ,array_agg(distinct upper(u.screen_name) IGNORE NULLS) as screen_names

      ,count(distinct upper(u.name)) as names_count
      ,array_agg(distinct UPPER(u.name) IGNORE NULLS) as names 

      ,count(distinct u.description) as descriptions_count
      ,string_agg(distinct u.description, " | ") as descriptions
    
      ,count(distinct u.url) as urls_count
      --,array_agg(DISTINCT url IGNORE NULLS) as urls

      ,count(distinct u.profile_image_url) as profile_image_urls_count
      --,array_agg(profile_image_url IGNORE NULLS) as profile_image_urls

  FROM `tweet-research-shared.f1_racing_2023.streaming_users` u
  JOIN (
    -- FOR EACH USER, GET THE LATEST VALUES
    SELECT latest.user_id, latest.last_accessed_at
      ,upper(u.screen_name) as screen_name
      ,u.name as name
      ,u.description
      ,u.url
      ,u.profile_image_url
      ,u.verified
      ,u.pinned_tweet_id
      ,u.followers_count
      ,u.following_count
      ,u.tweet_count
      ,u.listed_count
    FROM`tweet-research-shared.f1_racing_2023.streaming_users` u 
    JOIN (
      -- FOR EACH USER, WHAT IS THE TIMESTAMP OF THE LATEST ROW?
      SELECT user_id, max(accessed_at) as last_accessed_at
      FROM `tweet-research-shared.f1_racing_2023.streaming_users` -- 1384754
      GROUP BY 1
    ) latest on latest.last_accessed_at = u.accessed_at and latest.user_id = u.user_id  -- assumes no dups on these composite key fields
    --LIMIT 10
  ) latest_row on latest_row.user_id = u.user_id
  GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13
)

what a beast

s2t2 · 2023-09-26T20:25:53Z

Twitter Named Entity Recognition

Twitter Context Entities and their corresponding Domains:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.context_entities`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.context_entities` as (
  SELECT * 
  FROM `tweet-collector-py.jan6_committee_development.entities` 
  -- where I had previously uploaded this CSV file from twitterdev:
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-entities-20220601.csv
);

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.context_domains`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.context_domains` as (
  SELECT * 
  FROM `tweet-collector-py.jan6_committee_development.domains` 
  -- where I had previously uploaded this CSV file from twitterdev:
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-domains-20220601.csv
);

Joining and cleaning:

--SELECT e.entity_id, e.entity_name, e.domain_ids
-- FROM `tweet-research-shared.f1_racing_2023.context_entities` e -- 144,753
-- some entities have multiple domains. for example:
-- 1139229372198469633 Dogecoin cryptocurrency | domains: 131, 174

-- SELECT count(distinct entity_id) as entity_count 
-- FROM `tweet-research-shared.f1_racing_2023.context_entities` -- 144,731

-- SELECT DISTINCT entity_id,entity_name, domain_id
-- FROM `tweet-research-shared.f1_racing_2023.context_entities`,
-- UNNEST(domain_ids) as domain_id
-- WHERE entity_id = 1139229372198469633 -- 2 rows, great

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.context_entities_v2`;
CREATE TABLE `tweet-research-shared.f1_racing_2023.context_entities_v2` as (
    SELECT e.entity_id, e.entity_name, e.domain_id, d.domain_name
    FROM (
        SELECT DISTINCT entity_id,entity_name, domain_id
        FROM `tweet-research-shared.f1_racing_2023.context_entities`,
        UNNEST(domain_ids) as domain_id
        --WHERE entity_id = 1139229372198469633 -- 2 rows, great
    ) e
    LEFT JOIN `tweet-research-shared.f1_racing_2023.context_domains` d ON d.domain_id = e.domain_id
);

Analysis:

SELECT domain_id, domain_name, count(distinct entity_id) as entity_count
FROM `tweet-research-shared.f1_racing_2023.context_entities_v2`
-- WHERE domain_name is null -- 0 awesome
GROUP BY 1,2
ORDER BY 3 DESC
-- 31 Community: 71215
-- 131 Unified Twitter Taxonomy: 47499
-- 10 Person:  43188
-- 60 Athlete: 17960

JiazanShi · 2023-09-27T19:20:38Z

Status mentions v2 (row per each user mentioned in a given status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_mentions_v2` as (
  SELECT DISTINCT status_id, user_id, UPPER(user_screen_name) as user_screen_name
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_mentions` sm
  --LIMIT 10
)

Status hashtags v2 (row per hastag per status):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_hashtags_v2` as (
  SELECT DISTINCT status_id, UPPER(tag) as tag
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_hashtags` sh
  --LIMIT 10
)

Status urls v2 (row per status per url, with domains parsed 😸 ):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_urls_v2` as (

  SELECT status_id, url, 
      regexp_replace(
            split(split(url, "://")[OFFSET(1)], "/")[OFFSET(0)]
            , "www.", ""
      ) as url_domain
  FROM (
      SELECT DISTINCT status_id, url
      FROM `tweet-research-shared.f1_racing_2023.streaming_status_urls` su
      --LIMIT 10
  )
)

Status entities v2 (row per status per named entity) -- named entity recognition (todo: import the lookup tables):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_entities_v2` as (
  -- HERE ARE THE CONTEXT DOMAIN AND ENTITY LOOKUP TABLES
  -- https://developer.twitter.com/en/docs/twitter-api/annotations/overview
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-domains-20220601.csv
  -- https://github.com/s2t2/twitter-context-annotations/blob/main/files/evergreen-context-entities-20220601.csv
  -- 
  SELECT DISTINCT status_id, domain_id, entity_id
  FROM `tweet-research-shared.f1_racing_2023.streaming_status_entities`
  -- LIMIT 10
)

Status annotations v2 (row per status per annotation type and text):

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2` as (
  -- annotations with given type (Other, Organization), and annotation text
  -- count and aggregate the non-dup probability scores
  SELECT status_id, type, text
      ,min(probability) as min_probability
      ,min(probability) as max_probability
      ,avg(probability) as avg_probability
      ,count(*) as row_count
  FROM (
    -- remove dups first
    SELECT DISTINCT status_id, type, upper(text) as text, probability
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_annotations`
  )
  GROUP BY 1,2,3
  ORDER BY 7 DESC
)

update migration query for status annotation_v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_status_annotations_v2` as (
  -- annotations with given type (Other, Organization), and annotation text
  -- count and aggregate the non-dup probability scores
  SELECT status_id, type, text
      ,MIN(probability) as min_probability
      ,MAX(probability) as max_probability
      ,AVG(probability) as avg_probability
      ,COUNT(*) as row_count
  FROM (
    -- remove dups first
    SELECT DISTINCT status_id, type, upper(text) as text, probability
    FROM `tweet-research-shared.f1_racing_2023.streaming_status_annotations`
  )
  GROUP BY 1,2,3
  ORDER BY 7 DESC
)

s2t2 · 2023-09-29T18:01:43Z

Streaming Rules

Matching rule names with their identifiers, and bumping the rules table:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_rules_v2`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_rules_v2` as (
    SELECT 
        CASE -- we manually matched the names to ids after inspecting tweets matching the given rule id
            WHEN rule = '#F1' THEN 1634579704828117005
            WHEN rule = '@F1' THEN 1634579704828117001
            WHEN rule = '@MercedesAMGF1' THEN 1634579704828117008
            WHEN rule = '@redbullracing' THEN 1634579704828116999
            WHEN rule = '@ScuderiaFerrari' THEN 1634579704828117004
            WHEN rule = '@AstonMartinF1' THEN 1634579704828116995
            WHEN rule = '#Formula1' THEN 1634579704828116993
            WHEN rule = '@ESPNF1' THEN 1634579704828117002
            WHEN rule = '@McLarenF1' THEN 1634579704828117003
            WHEN rule = '@SkySportsF1' THEN 1634579704828116992
            WHEN rule = '@AlpineF1Team' THEN 1634579704828116998
            WHEN rule = '@WilliamsRacing' THEN 1634579704828117006
            WHEN rule = '@AlphaTauriF1' THEN 1634579704828116994
            WHEN rule = '@HaasF1Team' THEN 1634579704828117007
            WHEN rule = '@AlfaRomeoF1' THEN 1634579704828117000
        END rule_id
        ,rule as rule_name
        , created_at
    FROM `tweet-research-shared.f1_racing_2023.streaming_rules`
);

JiazanShi · 2023-10-09T19:54:30Z

de-dup query for tweets_v2:

SELECT COUNT(*)
FROM(
SELECT DISTINCT status_id,
    ANY_VALUE(status_text) AS status_text,
    ANY_VALUE(created_at) AS created_at,
    ANY_VALUE(user_id) AS user_id,
    ANY_VALUE(retweet_status_id) AS retweet_status_id,
    ANY_VALUE(reply_status_id) AS reply_status_id,
    ANY_VALUE(quote_status_id) AS quote_status_id,
    ANY_VALUE(conversation_id) AS conversation_id,
    ANY_VALUE(langs) AS langs
  FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`
  GROUP BY 1
)
-- 9676610

SELECT COUNT(DISTINCT status_id)
 FROM `tweet-research-shared.f1_racing_2023.streaming_tweets` 
 -- 9676610 
-- we have 9676610 unique id from the original dataset and now the number of unique id in tweets_2 should be identical

JiazanShi · 2023-10-11T17:11:16Z

tweets dataset update to 'tweets_v3' to further de dups from v2:

DROP TABLE IF EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v3`;
CREATE TABLE IF NOT EXISTS `tweet-research-shared.f1_racing_2023.streaming_tweets_v3` AS (
  SELECT DISTINCT status_id,
    ANY_VALUE(status_text) AS status_text,
    ANY_VALUE(created_at) AS created_at,
    ANY_VALUE(user_id) AS user_id,
    ANY_VALUE(retweet_status_id) AS retweet_status_id,
    ANY_VALUE(reply_status_id) AS reply_status_id,
    ANY_VALUE(quote_status_id) AS quote_status_id,
    ANY_VALUE(conversation_id) AS conversation_id,
    ANY_VALUE(langs) AS langs
  FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v2`
  GROUP BY 1
);

-- check
SELECT COUNT(*), COUNT(DISTINCT status_id)
FROM `tweet-research-shared.f1_racing_2023.streaming_tweets_v3`
-- 9676610, 9676610

s2t2 changed the title ~~Queries~~ Queries and Migrations Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queries and Migrations #4

Queries and Migrations #4

s2t2 commented Sep 18, 2023 •

edited

Loading

s2t2 commented Sep 18, 2023 •

edited

Loading

s2t2 commented Sep 18, 2023

s2t2 commented Sep 18, 2023 •

edited

Loading

JiazanShi commented Sep 19, 2023

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 24, 2023 •

edited

Loading

s2t2 commented Sep 26, 2023 •

edited

Loading

JiazanShi commented Sep 27, 2023

s2t2 commented Sep 29, 2023

JiazanShi commented Oct 9, 2023

JiazanShi commented Oct 11, 2023

Queries and Migrations #4

Queries and Migrations #4

Comments

s2t2 commented Sep 18, 2023 • edited Loading

s2t2 commented Sep 18, 2023 • edited Loading

De-duplicating

s2t2 commented Sep 18, 2023

s2t2 commented Sep 18, 2023 • edited Loading

Status Rules

JiazanShi commented Sep 19, 2023

s2t2 commented Sep 23, 2023 • edited Loading

s2t2 commented Sep 23, 2023 • edited Loading

s2t2 commented Sep 23, 2023 • edited Loading

s2t2 commented Sep 23, 2023 • edited Loading

News Quality

s2t2 commented Sep 24, 2023 • edited Loading

s2t2 commented Sep 26, 2023 • edited Loading

Twitter Named Entity Recognition

JiazanShi commented Sep 27, 2023

s2t2 commented Sep 29, 2023

Streaming Rules

JiazanShi commented Oct 9, 2023

JiazanShi commented Oct 11, 2023

s2t2 commented Sep 18, 2023 •

edited

Loading

s2t2 commented Sep 18, 2023 •

edited

Loading

s2t2 commented Sep 18, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 23, 2023 •

edited

Loading

s2t2 commented Sep 24, 2023 •

edited

Loading

s2t2 commented Sep 26, 2023 •

edited

Loading