Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twarc Tutorial #558

Merged
merged 22 commits into from
Dec 20, 2022
Merged
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
029f89c
Rework documentation and structure a bit to include a new tutorial se…
SamHames Oct 20, 2021
ecf47cd
Propose a skeleton structure for the tutorial
SamHames Oct 20, 2021
2200954
Cleanup the tutorial a little
SamHames Nov 29, 2021
83f62b3
First draft of the 'before we begin' section
SamHames Jan 17, 2022
2c37abb
More cleanup and moar words for the tutorial
SamHames Jan 24, 2022
2a9ae9b
Flesh out more of the actual search content
SamHames Feb 3, 2022
a617115
Update the data processing to avoid working with excel at all
SamHames Feb 8, 2022
65c5a6a
Incorporate more of the counts/search workflow as an interative approach
SamHames Feb 8, 2022
28cf9d6
Flesh out the search/counts worked example, cleanup some todos
SamHames Feb 9, 2022
17ae42d
Rearrange sections based on the suggestion [ci skip]
SamHames Feb 11, 2022
d0aaa0d
Make the research question more concrete
SamHames Feb 11, 2022
870f3f3
Link to existing resources section of the twarc docs [ci skip]
SamHames Feb 11, 2022
b566fc6
Add module installation to documentation local development instructions
betsybookwyrm May 23, 2022
0a4a0d9
Tutorial: Fill out API explainer section
betsybookwyrm May 23, 2022
ec3a257
Flesh out Intro to Twitter API tutorial section
betsybookwyrm Jul 15, 2022
cce8488
Minor proofreading fixes for tutorial
betsybookwyrm Oct 10, 2022
7cf5943
Require mkdocstrings[python] for docs building
betsybookwyrm Oct 10, 2022
52289d2
Explain API versions
betsybookwyrm Oct 10, 2022
e360879
Merge branch 'tutorial' of github.com:boyd-nguyen/twarc into tutorial
boyd-nguyen Nov 11, 2022
11f4dc8
added screenshots + editted code blocks + fixed minor typos
boyd-nguyen Nov 11, 2022
2432ec3
Merge pull request #665 from boyd-nguyen/tutorial
SamHames Nov 11, 2022
ece130f
Apply suggestions from code review
igorbrigadir Dec 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Tutorial: Fill out API explainer section
Adds a brief overview of what an API is to the tutorial.
betsybookwyrm committed Oct 10, 2022
commit 0a4a0d93bcb8f77825d71f7a6710e8a2a2daacc4
38 changes: 33 additions & 5 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
@@ -26,7 +26,35 @@ We'll answer this question with a simple quantitative approach to analysing the

### What is an API?

Brief explanation of an API, especially a web API. Also need to include a link to a primer somewhere else.
An Application Programming Interface (API) is a common method for software applications and services
to allow other systems or people to programmatically interact with their system. For example,
Twitter has an API which allows external systems to make requests to Twitter for information or
actions. Twitter (and many other web apps and services) use an HTTP REST API, meaning that to interact
with Twitter through the API you can send an HTTP request to a specific URL (also known as an endpoint) provided by Twitter, and
Twitter will respond with a bundle of information in JSON format for you.

Twarc acts as a tool or an intermediary for you to use so that you don't have to manage the details
of how exactly to make requests to the Twitter API and handle Twitter's responses. Twarc commands
correspond roughly with Twitter API endpoints. For example, when you use Twarc to fetch the timeline of a specific
twitter account (we'll use @Twitter in this example), this is the sequence of events:

1. You run `twarc2 timeline Twitter tweets.jsonl`
2. twarc2 makes a request on your behalf to the [Twitter v2 user lookup API endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/introduction)
in order to find the user ID for the @Twitter account, and receives a response from the Twitter API server with that user ID
3. twarc2 makes a request on your behalf to the [Twitter v2 timeline API endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/introduction),
using the user ID determined in step 2, and receives a response (or several responses) from the Twitter API server with @Twitter's tweets
4. twarc2 consolidates the timeline responses from step 3 and outputs them according to your initial command, in this case as `tweets.jsonl`

There are a great many resources on the internet to learn more about APIs more generally and how to use them in a
variety of contexts. Here are a few introductory articles:

- [How to Geek: What is an API, and how do developers use them?](https://www.howtogeek.com/343877/what-is-an-api/)
- [IBM: What is an API?](https://www.ibm.com/cloud/learn/api)

More detailed information on APIs and working with them:

- [Zapier: An introduction to APIs](https://zapier.com/learn/apis/)
- [RealPython: Python and REST APIs: Interacting with web services](https://realpython.com/api-integration-in-python/)

### What can you do with the Twitter API?

@@ -198,7 +226,7 @@ Let's improve this by updating our command to:

And we should see output like below. Note that the `--text` and `--granularity` are optional flags provided to the `twarc2 counts` command, we can see other options by running `twarc2 counts --help`. In this case `--text` returns a simplified text output for easier reading, and `--granularity day` is passed to the Twitter API to specify that we're interested only in daily counts of tweets, not the default hourly count.
igorbrigadir marked this conversation as resolved.
Show resolved Hide resolved

<table of results>
<table of results />

Note that this is only the count for the last seven days - this is the level of search functionality available for all developers via the standard track of the Twitter API. If you have access to the [Twitter Academic track](https://developer.twitter.com/en/use-cases/do-research/academic-research), you can switch to searching the full Twitter archive from the `counts` and `search` commands by adding the `--archive` flag.

@@ -208,21 +236,21 @@ Let's work through this example a little further, first we want to expand to cap

`twarc2 counts "echidna echidna's echidnas" --granularity day --text`

<table of results>
<table of results />

Suddenly we're retrieving very few results! By default, if you don't specify an operator, the Twitter API assumes you mean AND, or that all of the words should be present - we will need to explicitly say that we want any of these words using the OR operator:

`twarc2 counts "echidna OR echidna's OR echidnas" --granularity day --text`

<table of results>
<table of results />

We can also apply operators based on other content or properties of tweets (see more [search operators](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list) in the Twitter API documentation). Because we're deciding to focus on the number of likes on tweets as our measure of coolness, we want to exclude retweets. If we don't exclude retweets, our like measure might be heavily influenced by one highly retweeted tweet.

We can do this using the `-` (minus) operator, which allows us to exclude tweets matching a criteria, in conjunction with the `is:retweet` operator, which filters on whether the tweet is a retweet or not. If we applied just the `is:retweet` operator we'd only see the retweets, the opposite of what we want.

`twarc2 counts "echidna OR echidna's OR echidnas -is:retweet" --granularity day --text`

<table of results>
<table of results />

There's one tiny gotcha from the Twitter API here, which is important to know about. AND operators are applied before OR operators, even if the AND is not specified by the user. The query we wrote above actually means something like below. We're only removing the retweets containing the word "echidnas", not all retweets: