Skip to content
This repository has been archived by the owner on Nov 11, 2023. It is now read-only.

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jdevoo committed Mar 6, 2019
0 parents commit 3f8bc3a
Show file tree
Hide file tree
Showing 17 changed files with 1,545 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
nucoll
target/
/*.dat
/*.qry
/*.gml
fdat/
img/
7 changes: 7 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright 2019 Jean-Paul de Vooght

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
56 changes: 56 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
BINARY = nucoll
NEWTAG := $(shell git describe --abbrev=0 --tags)
OLDTAG := $(shell git describe --abbrev=0 --tags `git rev-list --tags --skip=1 --max-count=1`)

NIX_BINARIES = linux/amd64/$(BINARY) darwin/amd64/$(BINARY)
WIN_BINARIES = windows/amd64/$(BINARY).exe
COMPRESSED_BINARIES = $(NIX_BINARIES:%=%.bz2) $(WIN_BINARIES:%.exe=%.zip)
COMPRESSED_TARGETS = $(COMPRESSED_BINARIES:%=target/%)

temp = $(subst /, ,$@)
OS = $(word 2, $(temp))
ARCH = $(word 3, $(temp))
GITHASH = $(shell git log -1 --pretty=format:"%h")
GOVER = $(word 3, $(shell go version))
LDFLAGS = -ldflags '-X main.version=$(NEWTAG) -X main.githash=$(GITHASH) -X main.golang=$(GOVER)'

RELEASE_TOOL = github-release
USER = jdevoo

all: $(BINARY)

target/linux/amd64/$(BINARY):
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build $(LDFLAGS) -o "$@"
target/darwin/amd64/$(BINARY):
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build $(LDFLAGS) -o "$@"
target/windows/amd64/$(BINARY).exe:
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build $(LDFLAGS) -o "$@"

%.bz2: %
tar -cjf "$@" -C $(dir $<) $(BINARY)

%.zip: %.exe
zip -j "$@" "$<"

$(BINARY):
go build $(LDFLAGS) -o $(BINARY)

install:
go install $(LDFLAGS)
$(BINARY) -v

# git tag v0.1
release:
$(MAKE) $(COMPRESSED_TARGETS)
git push && git push --tags
git log --pretty=format:"%s" $(OLDTAG)...$(NEWTAG) | $(RELEASE_TOOL) release -u $(USER) -r $(BINARY) -t $(NEWTAG) -n $(NEWTAG) -d - || true
$(foreach FILE, $(COMPRESSED_BINARIES), $(RELEASE_TOOL) upload -u $(USER) -r $(BINARY) -t $(NEWTAG) -n $(subst /,-,$(FILE)) -f target/$(FILE);)

clean:
rm -f $(BINARY)
rm -rf target

test:
go test -v ./...

.PHONY: install release clean test
127 changes: 127 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
Nucoll is a command-line tool written in Go. It can be used to retrieve data from Twitter and is based on its predecessor, twecoll. Calls on the command line are based on a keyword which instruct nucoll what to do. Below is a list of examples followed by a brief explanation of each command. The limits of the public Twitter API apply and are indicated below.

## Examples

#### Downloading Tweets
Nucoll can download up to about 3200 tweets for a Twitter handle. A handle is specified by a screen name or user ID. It can also retrieve tweets for a given search query.

```
$ nucoll tweets jdevoo
```

The previous example would generate a `jdevoo.qry` file containing all tweets including timestamp and text in utf-8 encoding. In order search for tweets related to a certain hashtag or query expression, use the -q switch and double-quotes around the query string. Note this can retrieve many more tweets and is potentially a lengthy operations.

```
$ nucoll tweets -q "#dg2g"
```

This will also generate a `.qry` file named with a funny-looking name corresponding to the url-encoded search string.

#### Query File
A query file with extension `.qry` is just a text file that can also be created manually or produced by another tool. It contains handles which can be extracted by the init command. You could save a list of company handles to a file called `companies.qry` as in the example below.

```
@nike
@pfizer
@hugoboss
@ikea
@swatch
@pampers
@redbull
```

A switch to the tweets sub-command allows you to retrieve replies to a specified tweet ID. Note that this is limited in scope for the free API. For tweets which generates many comments, it will likely miss most of it.

#### Steps to Generating a Graph
One of the main uses of nucoll is to generate a GML file of second degree relationships. This is a three-step process that takes time due to API throttling by Twitter. In order to generate the graph, nucoll retrieves the handle's friends (or followers) and all friends-of-friends (2nd degree relationships). It then finds the relations between those, ignoring 2nd degree relationships to which the handle is not connected. In other words, it looks only for friend relationships among the friends/followers of the handle or query tweets initially supplied.

In this section, handles will be retrieved from the given handle but they could also be retrieved from a query file. First, we retrieve friends (by default) for the specified handle.

```
$ nucoll init jdevoo
```

This generates a `jdevoo.dat` file. When passed the -i option, init populates an `img` directory with avatar images. It is also possible to initialize from a query file using the -q option. Next, nucoll retrieves friends (that's always the case) of each entry in the `.dat` file.

```
$ nucoll fetch jdevoo
```

This populates the `fdat` directory with files per processed handle in `jdevoo.dat`.

This sub-command now supports the retrieval of followers who retweet content by the provided handle. It uses a maximum count of tweets per follower to examine. Note that this is a time-consuming operation considering it scans the entire follower set.

After running fetch, you generate the graph file in the third and final step.

```
$ nucoll edgelist jdevoo
```

This generates a `jdevoo.gml` file in Graph Model Language. You can use a package such as [Gephi](https://gephi.org/) to visualize your GML file. The GML file will include friends, followers, memberships and statuses counts as properties of each handle. You could then derive additional metrics e.g. the friends-to-followers or listed-to-followers ratios.

## Installation
Download the appropriate binary from the [releases](https://github.com/jdevoo/nucoll/releases) page.

On Windows, unzip the archive and place nucoll.exe on the path. On OS X and Linux, use tar e.g. `tar xf linux-amd64-nucoll.bz2` and place nucoll on the path.

If you have Go installed, you can execute `go get github.com/jdevoo/nucoll` to download, compile and install nucoll.

Then create a working directory to store the data from your expirments. Nucoll creates a number of files and folders to store its data.

* `fdat` directory containing friends of friends files
* `img` directory containing avatar images of friends
* `.dat` extension of account details data (friends, followers, avatar URL, etc. for account friends)
* `.qry` extension of tweets file (timestamp, tweet)
* `.gml` extension of edgelist file (nodes and edges)
* `.f` extension for friends data (fdat)

#### Registering Nucoll
The first time you run a nucoll command, it will ask you for the consumer key and consumer secret. Nucoll relies on oauth2 to authenticate with Twitter. You need to register your own copy of nucoll on first usage.

Twitter now makes you apply for a developer account and provide details about your intent which is just good data governance practice. This process takes some time as Twitter reviews your submission which includes an outline of your planned data experiment in 100 words. Once approved, create an application entry. Enabling sign-in with Twitter or URL callback are not required. I set this GitHub repo as website URL. Permissions are limited to read-only. If you try to run nucoll without being approved, you will get 401 errors on data from anyone but yourself. If you had twecoll previously registered and have been approved by Twitter for using its API, you can use the same Consumer API keys.

## Usage
Nucoll has built-in help and version switches invoked with -h and -v respectively. Each command can also be invoked with the help switch for additional information about its sub-options.

```
$ nucoll -h
usage: nucoll [-h] [-v]
{resolve,init,fetch,tweets,edgelist} ...
New Collection Tool
optional arguments:
-h show this help message and exit
-v show program's version number and exit
sub-commands:
{resolve,init,fetch,tweets,edgelist}
init retrieve friends data for screen_name
fetch retrieve friends of handles in .dat file
edgelist generate graph in GML format
tweets retrieve tweets
resolve retrieve user_id for screen_name or vice versa
```

## Motivation
The predecessor of nucoll is twecoll which was originally created as submission to the final assignment in Lada Adamic's SNA MOOC on Coursera (now on [openmichigan](https://open.umich.edu/find/open-educational-resources/information/si-508-networks-theory-application)). Twecoll requires the Python 2.7 runtime, is tightly coupled to Twitter and includes an optional dependency on igraph, a third-party SNA library. Instead, nucoll is a re-write in Go and ships as executables for popular operating systems. Its structure is meant to support more than one social network and relies on external tools such as Gephi for network visualization and metrics. It's also fun to learn a new programming language :-)

## License
This project is licensed under the MIT License.

## Citation
In case you use this tool to retrieve data for a paper and consider mentioning it. The version tag and commit hash have to be adapted accordingly.

J.P. de Vooght, nucoll, version Y.Y XXXXXX, (2019), GitHub repository, https://github.com/jdevoo/nucoll

```
@misc{DeVooght2019,
author = {de Vooght, Jean-Paul},
title = {nucoll},
year = {2019},
publisher = {GitHub},
journal = {GutHub repository},
howpublished = {\url{https://github.com/jdevoo/nucoll}},
commit = {XXXXXX}
}
```
165 changes: 165 additions & 0 deletions cmd.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
package main

import (
"flag"
"fmt"
"os"
"path/filepath"

"github.com/jdevoo/nucoll/twitter"
"github.com/jdevoo/nucoll/util"
)

// SocialNetworkService defines the interface for services such as Twitter
type SocialNetworkService interface {
Init(followersFlag bool, maxPostCount int, queryFlag bool, nomentionFlag bool, list string, imageFlag bool, args []string)
Fetch(forceFlag bool, fetchCount int, args []string)
Edgelist(egoFlag bool, missingFlag bool, args []string)
Posts(queryFlag bool, list string, postID uint64, args []string)
Resolve(args []string)
}

var (
version string // set by go tool
golang string // set by go tool
githash string // set by go tool
initMembers string
maxPostCount int
fetchCount int
postsList string
postsPostID uint64

helpFlag = flag.Bool("h", false, "show this help message and exit")
versionFlag = flag.Bool("v", false, "print version and exit")

initCommand = flag.NewFlagSet("init", flag.ExitOnError)
initFollowersFlag = initCommand.Bool("o", false, "retrieve followers (default friends)")
initQueryFlag = initCommand.Bool("q", false, fmt.Sprintf("extract handles from %s file (default screen_name)", util.QueryExt))
initNomentionFlag = initCommand.Bool("n", false, fmt.Sprintf("ignore mentions from %s file (default false)", util.QueryExt))
initImageFlag = initCommand.Bool("i", false, "download images (default false)")

edgelistCommand = flag.NewFlagSet("edgelist", flag.ExitOnError)
edgelistEgoFlag = edgelistCommand.Bool("e", false, "include screen_name (default false)")
edgelistMissingFlag = edgelistCommand.Bool("m", false, "include missing handles (default false)")

fetchCommand = flag.NewFlagSet("fetch", flag.ExitOnError)
fetchForceFlag = fetchCommand.Bool("f", false, fmt.Sprintf("ignore existing %s files (default false)", util.FdatExt))

resolveCommand = flag.NewFlagSet("resolve", flag.ExitOnError)

postsCommand = flag.NewFlagSet("tweets", flag.ExitOnError)
postsQueryFlag = postsCommand.Bool("q", false, "argument is a quoted query string (default screen_name)")

// Usage overrides PrintDefaults
Usage = func() {
fmt.Println("Usage: " + filepath.Base(os.Args[0]) + " [-h] [-v]")
fmt.Println(" {init,fetch,edgelist,tweets,resolve} ...")
fmt.Println()
fmt.Println("New Collection Tool")
fmt.Println()
fmt.Println("Sub-commands:")
fmt.Println(" init retrieve friends data for screen_name")
fmt.Printf(" fetch retrieve friends of handles in %s file\n", util.DatExt)
fmt.Println(" edgelist generate graph in GML format")
fmt.Println(" tweets retrieve tweets")
fmt.Println(" resolve retrieve user_id for screen_name or vice versa")
fmt.Println()
fmt.Println("Optional arguments:")
flag.PrintDefaults()
}
)

func init() {
initCommand.StringVar(&initMembers, "m", "", "extract member handles from list owned by screen_name")
initCommand.IntVar(&maxPostCount, "r", 0, "tweet count limit when looking for retweets by followers")
initCommand.Usage = func() {
fmt.Println("Usage: " + filepath.Base(os.Args[0]) + " init [-h] [i] [-m list] [-n] [-o] [-q] [-r N] screen_name")
initCommand.PrintDefaults()
}
fetchCommand.IntVar(&fetchCount, "c", 5000, "skip if friends count above limit")
fetchCommand.Usage = func() {
fmt.Println("Usage: " + filepath.Base(os.Args[0]) + " fetch [-h] [-c N] [-f] screen_name")
fetchCommand.PrintDefaults()
}
edgelistCommand.Usage = func() {
fmt.Println("Usage: " + filepath.Base(os.Args[0]) + " edgelist [-h] [-e] [-m] screen_name [screen_name...]")
edgelistCommand.PrintDefaults()
}
resolveCommand.Usage = func() {
fmt.Println("Usage: " + filepath.Base(os.Args[0]) + " resolve [-h] screen_name [screen_name...]")
}
postsCommand.StringVar(&postsList, "m", "", "extract tweets from list")
postsCommand.Uint64Var(&postsPostID, "p", 0, "replies to tweet id by screen_name")
postsCommand.Usage = func() {
fmt.Println("Usage: " + filepath.Base(os.Args[0]) + " tweets [-h] [-p id] [-m list] [-q] <screen_name | \"query\">")
postsCommand.PrintDefaults()
}
}

func main() {
var sns SocialNetworkService

flag.Parse()
if *versionFlag {
fmt.Printf("New Collection Tool %s (%s %s)\n", version, golang, githash)
os.Exit(0)
}
if len(flag.Args()) == 0 || *helpFlag {
Usage()
os.Exit(1)
}

sns = twitter.Twitter{}

switch os.Args[1+flag.NFlag()] {
case "init":
if err := initCommand.Parse(os.Args[2:]); err == nil {
if initCommand.NArg() == 1 {
sns.Init(*initFollowersFlag, maxPostCount, *initQueryFlag, *initNomentionFlag, initMembers, *initImageFlag, initCommand.Args())
} else {
initCommand.Usage()
os.Exit(1)
}
}
case "edgelist":
if err := edgelistCommand.Parse(os.Args[2:]); err == nil {
if edgelistCommand.NArg() > 0 {
sns.Edgelist(*edgelistEgoFlag, *edgelistMissingFlag, edgelistCommand.Args())
} else {
edgelistCommand.Usage()
os.Exit(1)
}
}
case "fetch":
if err := fetchCommand.Parse(os.Args[2:]); err == nil {
if fetchCommand.NArg() == 1 {
sns.Fetch(*fetchForceFlag, fetchCount, fetchCommand.Args())
} else {
fetchCommand.Usage()
os.Exit(1)
}
}
case "resolve":
if err := resolveCommand.Parse(os.Args[2:]); err == nil {
if resolveCommand.NArg() > 0 {
sns.Resolve(resolveCommand.Args())
} else {
resolveCommand.Usage()
os.Exit(1)
}
}
case "tweets":
if err := postsCommand.Parse(os.Args[2:]); err == nil {
if postsCommand.NArg() > 0 {
sns.Posts(*postsQueryFlag, postsList, postsPostID, postsCommand.Args())
} else {
postsCommand.Usage()
os.Exit(1)
}
}
default:
fmt.Printf("%q is not a valid command\n", os.Args[1])
os.Exit(1)
}
os.Exit(0)
}
Loading

0 comments on commit 3f8bc3a

Please sign in to comment.