Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactoring of package structure #189

Merged
merged 8 commits into from
Apr 6, 2018
Merged

Major refactoring of package structure #189

merged 8 commits into from
Apr 6, 2018

Conversation

lintool
Copy link
Member

@lintool lintool commented Apr 4, 2018

This is a major refactoring of package structure to simplify and rationalize the class and import structure.


GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

What does this Pull Request do?

This big patch simplifies the package import structure, such that a script now looks like this:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

Note that the import statements make sense now: we import AUT, and also the UDFs in matchbox.

How should this be tested?

Our tests should be a start, but we won't know issues until we deploy w/ AUK.

@codecov
Copy link

codecov bot commented Apr 4, 2018

Codecov Report

Merging #189 into master will decrease coverage by 0.2%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #189      +/-   ##
==========================================
- Coverage   67.44%   67.24%   -0.21%     
==========================================
  Files          33       32       -1     
  Lines         639      635       -4     
  Branches      125      124       -1     
==========================================
- Hits          431      427       -4     
  Misses        167      167              
  Partials       41       41
Impacted Files Coverage Δ
.../archivesunleashed/matchbox/RemoveHttpHeader.scala 100% <ø> (ø)
.../scala/io/archivesunleashed/matchbox/package.scala 100% <ø> (ø)
...ain/scala/io/archivesunleashed/ArchiveRecord.scala 80.55% <ø> (ø)
...io/archivesunleashed/matchbox/TupleFormatter.scala 57.14% <ø> (ø)
...ala/io/archivesunleashed/app/ExtractEntities.scala 0% <ø> (ø)
.../io/archivesunleashed/matchbox/ExtractDomain.scala 100% <ø> (ø)
...archivesunleashed/matchbox/ExtractAtMentions.scala 100% <ø> (ø)
...ain/scala/io/archivesunleashed/app/WriteGEXF.scala 100% <ø> (ø)
...chivesunleashed/matchbox/ExtractTextFromPDFs.scala 100% <ø> (ø)
.../archivesunleashed/data/ArchiveRecordWritable.java 68.08% <ø> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update beaa950...a46ce3b. Read the comment docs.

@lintool
Copy link
Member Author

lintool commented Apr 4, 2018

A bit more explanation of the new revised package structure:

For all scripts, the two most important imports are:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

Think of the first as "base" AUT, and then all the UDFs in matchbox.

  • io.archivesunleashed.data holds all the Java classes: this has been consolidated.
  • io.archivesunleashed.matchbox holds all the UDFs.
  • io.archivesunleashed.app holds larger bundles of functionality that aren't UDFs (many of these were previously misplaced in matchbox). The contents of this package will need refactoring later, because they aren't actually "apps" yet.
  • io.archivesunleashed.util holds utilities and helpers. E.g., StringUtils has been moved here.

@greebie
Copy link
Contributor

greebie commented Apr 4, 2018

Okay here is the revised test script I have used so far. It works so far with example, but I have some other warcs I can use locally as well.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.StringUtils._

import java.time.Instant

val warcPath = "../src/test/resources/warc/example.warc.gz"

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
println("Get urls and count, taking 3.")
val r = RecordLoader.loadArchives(warcPath, sc)
.keepValidPages()
.map (r => ExtractDomain(r.getUrl))
.countItems()
println(r.take(3).deep.mkString("\n"))
}


timed {
println("Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.")
val links = RecordLoader.loadArchives(warcPath, sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)
println(links.take(3).deep.mkString("\n"))
}


timed {
println("Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.")
val crawlDateGroup = RecordLoader.loadArchives(warcPath, sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
println(crawlDateGroup.take(3).deep.mkString("\n"))
}


timed {
println ("Extract text, taking 3 examples.")
val text = RecordLoader.loadArchives(warcPath, sc)
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
println(text.take(3).deep.mkString("\n"))
}

timed {
println ("Extract image urls, taking 3.")
val images = RecordLoader.loadArchives(warcPath, sc)
.keepValidPages()
.flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
.countItems()
println (images.take(3).deep.mkString("\n"))
}

and the initial results (which seem in kind with previous tests conducted at #121):

Get urls and count, taking 3.
(www.archive.org,132)
(deadlists.com,2)
(www.hideout.com.br,1)
Elapsed Time: 4076
Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.
((archive.org,archive.org),316)
((archive.org,wiki.etree.org),21)
((archive.org,creativecommons.org),12)
Elapsed Time: 2120
Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.
((20080430,archive.org,archive.org),316)
((20080430,archive.org,wiki.etree.org),21)
((20080430,archive.org,creativecommons.org),12)
Elapsed Time: 1557
Extract text, taking 3 examples.
(20080430,www.archive.org,http://www.archive.org/,HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:25 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT ETag: "47ac-16e-4f9e5b40" Accept-Ranges: bytes Content-Length: 366 Connection: close Content-Type: text/html; charset=UTF-8 Please visit our website at: http://www.archive.org)
(20080430,www.archive.org,http://www.archive.org/index.php,HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:25 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g X-Powered-By: PHP/5.0.5-2ubuntu1.4 Set-Cookie: PHPSESSID=657fa9749e9426f2ffa75f14b54ed4ac; path=/; domain=.archive.org Connection: close Content-Type: text/html; charset=UTF-8 Internet Archive Web | Moving Images | Texts | Audio | Software | Education | Patron Info | About IA Forums | FAQs | Contributions | Jobs | Donate Search: All Media Types   Wayback Machine   Moving Images     Animation & Cartoons     Arts & Music     Computers & Technology     Cultural & Academic Films     Ephemeral Films     Movies     News & Public Affairs     Non-English Videos     Open Source Movies     Prelinger Archives     Sports Videos     Video Games     Vlogs     Youth Media   Texts     American Libraries     Canadian Libraries     Open Source Books     Project Gutenberg     Biodiversity Heritage Library     Children's Library     Additional Collections   Audio     Audio Books & Poetry     Computers & Technology     Grateful Dead     Live Music Archive     Music & Arts     Netlabels     News & Public Affairs     Non-English Audio     Open Source Audio     Podcasts     Radio Programs     Spirituality & Religion   Software     CLASP     Tucows Software Library   Education Forums FAQs UploadAnonymous User (login or join us)     Announcements (more) Free Ultra High-Speed Internet to Public Housing Rise of the HighTech Non-Profits Zotero and Internet Archive join forces    Web 85 billion pages Advanced Search    Welcome to the Archive The Internet Archive is building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, we provide free access to researchers, historians, scholars, and the general public.       Moving Images  115,646 movies Browse   (by keyword)    Live Music Archive  48,893 concerts Browse   (by band)    Audio  250,854 recordings Browse   (by keyword)    Texts  395,004 texts Browse   (by keyword)       Curator's Choice (more) A Few Good G-Men Randall Glass, the maker of "Warthog Jump," re-creates in "A Few Good G-Men" an entire scene from...    Curator's Choice (more) Grateful Dead Live at Nashville Municipal... Set 1 Sugaree Beat It On Down The Line Candyman Me And My Uncle -> Big River Stagger Lee Looks Like...    Curator's Choice (more) Zanstones - Slaakhuis: Live in Rotterdam, Holland Zanstones confuses the dutch masses with this live display of wacked rhythms, whacked vocals, and...    Curator's Choice (more) Secret armies; the new technique of Nazi warfare       Recent Reviews Code4Lib 2008: Can Resource Description become Rigorous Data? Average rating: Madonna adopts African baby. Average rating:    Recent Reviews Carolina Chocolate Drops Live at MerleFest on 2007-04-27 Average rating: Grateful Dead Live at Oakland-Alameda County Coliseum on 1988-12-28 Average rating:    Recent Reviews No Thoroughfare Average rating: JAHTARI RIDDIM FORCE - Farmer In The Sky / Depth Charge Average rating:    Recent Reviews A manual of chemical analysis, qualitative and quantitative Average rating: Chemical lecture experiments; non-metallic elements Average rating:       Most recent posts (write a post by going to a forum) more... Subject Poster Forum Replies Views Date Re: Making a mix for a chick I know... William Tell GratefulDead 0 6 20 minutes ago Re: Bob's shorts not going into archives BobsShortShorts GratefulDead 0 9 26 minutes ago Re: Thanks to All airgarcia416 GratefulDead 0 5 26 minutes ago Re: Bob's shorts not going into archives sydthecat2 GratefulDead 0 8 36 minutes ago Re: What is the worst-reviewed feature film on IA? RipJarvis feature_films 0 9 50 minutes ago Re: Playin' In The Band...all day and all night sydthecat2 GratefulDead 0 11 58 minutes ago Re: Playin' In The Band...all day and all night rastamon GratefulDead 0 16 1 hour ago Re: Making a mix for a chick I know... caspersvapors GratefulDead 1 11 1 hour ago Re: Bob's shorts not going into archives rastamon GratefulDead 0 11 1 hour ago Re: Bob's shorts not going into archives bluedevil GratefulDead 1 13 1 hour ago      Institutional Support Alexa Internet HP Computer The Kahle/Austin Foundation Prelinger Archives National Science Foundation Library of Congress LizardTech Sloan Foundation Individual contributors   Skin: classic | columns | custom! Terms of Use (10 Mar 2001))
(20080430,www.archive.org,http://www.archive.org/details/DrinkingWithBob-MadonnaAdoptsAfricanBaby887,HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:30 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g X-Powered-By: PHP/5.0.5-2ubuntu1.4 Connection: close Content-Type: text/html; charset=utf-8 Internet Archive: Details: Madonna adopts African baby. Web | Moving Images | Texts | Audio | Software | Education | Patron Info | About IA Home Animation & Cartoons | Arts & Music | Computers & Technology | Cultural & Academic Films | Ephemeral Films | Movies | News & Public Affairs | Non-English Videos | Open Source Movies | Prelinger Archives | Sports Videos | Video Games | Vlogs | Youth Media Search: All Media Types   Wayback Machine   Moving Images     Animation & Cartoons     Arts & Music     Computers & Technology     Cultural & Academic Films     Ephemeral Films     Movies     News & Public Affairs     Non-English Videos     Open Source Movies     Prelinger Archives     Sports Videos     Video Games     Vlogs     Youth Media   Texts     American Libraries     Canadian Libraries     Open Source Books     Project Gutenberg     Biodiversity Heritage Library     Children's Library     Additional Collections   Audio     Audio Books & Poetry     Computers & Technology     Grateful Dead     Live Music Archive     Music & Arts     Netlabels     News & Public Affairs     Non-English Audio     Open Source Audio     Podcasts     Radio Programs     Spirituality & Religion   Software     CLASP     Tucows Software Library   Education Forums FAQs Advanced Search UploadAnonymous User (login or join us)  View movie View thumbnails Run time: 00:01:37 Play / Download (help) Quicktime (1.3 MB) All files: FTP HTTP Resources Bookmark Report errors Madonna adopts African baby. Internet Archive's in-browser video player requires JavaScript to be enabled. It appears your browser does not have it turned on. Please see your browser settings for this feature. embedding and help Madonna is an arrogant, publicity hungry, piece of trash!!! This item is part of the collection: blip.tv Write a review Reviews Downloaded 61 times Average Rating: Reviewer: _sprout - - April 27, 2008 Subject: Madonna is a washed up hag trying to keep her name in the papers +5 stars because I agree with your general statement that these 'exotic' kids are like exotic pets for rich people and celebs to show off. -2 stars because this sort of thing is better suited to Youtube. Reviewer: XXXmoan - - April 25, 2008 Subject: are You freakin Serious What the fuck! who cares if she goes to adopt an african, thats none of your business. you need to chill like that other bitch who claims that she doesn't care that madonna fell off a horse. so my question is.............................................................................what the fuck your problem Terms of Use (10 Mar 2001))
Elapsed Time: 117
Extract image urls, taking 3.
(http://www.archive.org/images/star.png,408)
(http://www.archive.org/images/no_star.png,122)
(http://www.archive.org/images/logo.jpg,118)
Elapsed Time: 1349
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.StringUtils._
import java.time.Instant
warcPath: String = ../src/test/resources/warc/example.warc.gz
timed: (f: => Unit)Unit

@greebie
Copy link
Contributor

greebie commented Apr 4, 2018

Same code, different warcs (a collection of files this time):

Get urls and count, taking 3.
(notstrategic.blogspot.ca,292)                                                  
(www.uvicfa.ca,73)
(uvfawhycertify.org,70)
Elapsed Time: 6291
Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.
((notstrategic.blogspot.ca,notstrategic.blogspot.ca),7234)                      
((uvic.ca,uvic.ca),5683)
((notstrategic.blogspot.ca,blogger.com),5584)
Elapsed Time: 8033
Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.
((20140303,notstrategic.blogspot.ca,notstrategic.blogspot.ca),7234)             
((20140221,uvic.ca,uvic.ca),5683)
((20140303,notstrategic.blogspot.ca,blogger.com),5584)
Elapsed Time: 7424
Extract text, taking 3 examples.
(20140305,www.facebook.com,https://www.facebook.com/ajax/pagelet/generic.php/ProfileTimelineSectionPagelet?ajaxpipe=1&ajaxpipe_token=AXjGHbVGAzxKWFnI&no_script_path=1&data=%7B%22profile_id%22%3A203757673123182%2C%22start%22%3A1357027200%2C%22end%22%3A1388563199%2C%22query_type%22%3A8%2C%22filter_after_timestamp%22%3A1378934453%2C%22section_pagelet_id%22%3A%22pagelet_timeline_year_last%22%2C%22load_immediately%22%3Afalse%2C%22force_no_friend_activity%22%3Afalse%7D&__user=0&__a=1&__dyn=7wKzS10Ax-7o8UhACGeGEmBWpU&__req=jsonp_2&__rev=1144733&__adt=2,HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Pragma: no-cache Cache-Control: private, no-cache, no-store, must-revalidate Expires: Sat, 01 Jan 2000 00:00:00 GMT P3P: CP="Facebook does not have a P3P policy. Learn why here: http://fb.me/p3p" Set-Cookie: datr=I20XU23yMpUji7cCx8Izh_4c; expires=Fri, 04-Mar-2016 18:29:55 GMT; path=/; domain=.facebook.com; httponly X-FB-Debug: YX4elgxPf3U7PARSuaw+Vmn26WwLVEipQCEdij+0JdU= Date: Wed, 05 Mar 2014 18:29:55 GMT Connection: close Content-Length: 13009)
(20140305,www.sfufa.ca,http://www.sfufa.ca/2676/the-unionization-question/,HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Server: Microsoft-IIS/7.5 X-Powered-By: PHP/5.4.0 X-Pingback: http://www.sfufa.ca/xmlrpc.php Link: ; rel=shortlink X-Powered-By: ASP.NET Date: Wed, 05 Mar 2014 18:22:41 GMT Connection: close Content-Length: 40473 The Unionization Question | SFU Faculty Association Home About Us SFUFA Constitution and Bylaws SFUFA Internal Policies SFUFA Grievance and Arbitration Agreement Representation Decisions and Appeals Policy Policy on Payment of Members’ Legal Expenses Partners CUFA/BC CAUT Education International Executive The Role of the Executive Current (2013-2014) Executive Committee Meeting Minutes Staff Committees and Advisors SFUFA Archives Previous Campaigns and Issues Term Limits for Chairs and Directors Campaign Against Cutbacks 2009 Teaching and Learning Task Force 2008 Ending Mandatory Retirement Previous Executive Committees Past Publications Previous Salary Agreements What We Do SFUFA Services Advocacy and Advising Collective Bargaining Consultation and Lobbying Transit Passes and Affinity Programs Translink Pass CAUT Affinity Programs Housing Listings Accomodations Available Accomodations Wanted Academic Discussion Email List Current Issues Collective Bargaining Bargaining Bulletins, 2014 Round (Current) Salary Arbitration 2013 Bargaining Bulletins, 2012 round (completed) Equity Unionization Resources News Accreditation Resources Learning Outcomes Resources SFUFA Executive Cmte Position on Learning Outcomes Governance Copyright Salaries & Benefits Salary Scales SFU Salaries Policy Benefits PDR Pensions Enhanced Early Retirement Other Retirement Options Agreements & Policies Introductory Information Framework Agreement SFU Policies Letters of Understanding and Other Agreements Ending of Mandatory Retirement Early Promotion to Senior Lecturer Letter of Agreement on Investigative Procedures SFUFA Commentaries on Key Issues Intellectual Property Academic Freedom Strikes and Job Action Academic Resources Renewal, Promotion and Tenure Sample Teaching Dossier Teaching and Learning Research Grants and Services Research Policies and Administration Campus Services Home › Uncategorized › The Unionization Question The Unionization Question Written May 02nd, 2013 May 2013 Introduction: At the SFUFA General Meeting in March 2013, members passed a motion directing the Faculty Association Executive to explore the potential for unionization for further discussion at the Fall General meeting in October. Under our constitution, motions from the floor are allowed, and any motion other than one amending the bylaws or otherwise restricted by the Societies Act is passed by a simple majority. The Faculty Association Executive is not taking a position on this issue, believing that such a decision rests solely with members. The following is provided to give members some background on the unionization question which we hope will help you become aware of the central issues and provide ample notice that this question will be on the agenda in October – we certainly do not want a vote on such an important issue to either pass or fail because members were not aware it was to be discussed. We intend to hold an event on the afternoon of May 28th including a panel discussion on unionization, with member-speakers on both sides of the question. We anticipate further events in the coming months to continue the discussion. At the Fall General Meeting, likely sometime in October of this year, we will provide a report back on what we have heard, and members may elect to either dismiss unionization, move forward toward unionization, or carry the discussion further. We cannot predict at this time what if any decisions might be reached at that meeting, but believe it is important that all members understand as early as possible that this issue is squarely on the agenda for discussion in the Fall, and all those with opinions on the matter are strongly encouraged to attend. The Growing Trend to Faculty Unionization: The vast majority of Canadian faculty associations are unionized in all provinces; only in Alberta is that not the case as that province explicitly disallows faculty unionization and provides instead a legislative framework for faculty representation. In British Columbia, faculty are unionized at all public post secondary institutions except SFU, Uvic and UNBC. 
Both the Uvic and UNBC associations, however, have indicated that they are actively considering unionization at this time, and the UVic Faculty Association in particular expects to undertake a formal union drive in the next academic year. The differences between unionized and non-unionized associations are minimal in certain areas of our work, significant in others. Unionization is not a decision that any association takes lightly and presents as many challenges as advantages; there is no question, however, that the trend in Canadian universities is overwhelmingly toward unionization, and it appears that trend is rapidly gaining currency among our partner organizations here in B.C. Our Current Situation:
 Under the terms of our Framework Agreement, SFUFA is recognized by the SFU administration as the sole collective bargaining agent for faculty members, and enjoys legal status as our members’ official representative body. This means that for purposes of salaries and economic benefits, and certain other policies that are mutually-recognized as ‘negotiated’, SFUFA and SFUFA alone may enter agreements with the university to amend salaries or conditions of employment for members. The university may not enter into agreement on those issues with anyone but SFUFA, or recognize any other employment representative of our members. Likewise, SFUFA has the right to collect dues, and all employees in our member groups are deemed SFUFA members unless the individual explicitly opts out (paying the dues instead to an eligible organization). Finally, as part of its right to bargain and collect member dues, SFUFA has a legal obligation to represent its members in good faith in all employment-related matters which fall under its jurisdiction. To do so, we have a number of means at our disposal from informal dispute resolution to final and binding arbitration, and members may hold us to account (both politically and legally) if we fail to adequately discharge our duty. What Unionization Means: In all of the above respects, SFUFA resembles a legally-recognized trade union. There are, however, differences, the following being the most significant. a) a trade union’s legal status is granted under the authority of the BC Labour Relations Code and disputes are resolved under the auspices of the Labour Relations Board, which has legal oversight with regard to ‘the interpretation, application or alleged violation of a collective agreement’. Under the current system here at SFU, our legal status is essentially a grant of the university, and disputes are to be ultimately resolved by the courts unless an alternative process has been agreed. (We do typically maintain an agreement with the university to rely on LRB-associated arbitrators rather than the courts, as they are more familiar with employment relationships than the courts might be.) b) our scope of bargaining is limited to those areas SFU has formally agreed to allow us to negotiate. That is, a trade union represents its members in all employment-related matters (unless restricted by statute), and has sole bargaining agent status on a more comprehensive basis than do we. While we may bargain policies that the university has conceded are negotiable, all other policies and procedures are solely discretionary. For example, matters of salary, most benefits, and many conditions of appointment (workload, tenure and promotion processes etc.) are indeed negotiated policies; significant matters, though, are not accepted as ‘negotiable’ – many policies related to consultation, leaves of absence, and use of non-SFUFA members to perform what is typically faculty work, for example, may not be subject to negotiations. In terms of both process and substance, then, there are differences between unionized and non-unionized associations. In general, unionization expands rather than shrinks the association’s reach and provides it a greater ability to advocate for members – in scope of bargaining, in the ability to represent members in more areas of potential dispute, and in legal remedies and avenues of appeal. The cost, of course, is the unionization process, which can be lengthy and potentially divisive, and the increased legal responsibilities the association incurs with unionization. Processes of Unionization – certification vs voluntary recognition: The most common path to unionization is what is called ‘certification’ – in this process, the association collects signatures or ‘cards’ from members indicating their desire to unionize. We would need to collect signed cards from 45% of eligible employees in order to ask the Labour Relations Board to hold a vote. That vote would then require that a majority of ballots were cast in favour of unionization. If the vote passed, we would be ‘certified’ as a union under the Labour Code, and would then serve notice to the university to begin bargaining a first true collective agreement. A less common route is ‘voluntary recognition’. Here, there is not necessarily any collection of cards, and unionization occurs because the employer willingly agrees to recognize the association as a union. There does not need to be a certification vote overseen by the LRB, but some process of ratification in which members clearly indicate their consent to unionization is required. The UBC Faculty Association is the closest example of voluntary recognition of a faculty association for our purposes. As a result of a union drive to organize sessional faculty, UBC agreed to recognize the FA as a whole as a union; the university managed to avoid a new union emerging on campus to represent sessional faculty, and the faculty and librarians became unionized. UBCFA and the university arranged to work slowly over a couple of years to transform the existing Framework Agreement and policies into a collective agreement, and agreed, too, to retain binding arbitration instead of using strikes or lockouts to resolve any breakdowns in bargaining (discussed in more detail below). Certification and voluntary recognition achieve the same thing – formalization of the association as a union under the BC Labour Code, with all the rights and responsibilities that designation provides. There is one substantive difference – a certified trade union can later approach the LRB to include new employee groups in its bargaining unit; a voluntarily recognized union generally cannot, as its union recognition arose out of explicit agreement with the employer to cover clearly designated groups and those groups alone. Both certification and voluntary recognition, then, are options we could consider. While at SFU there is currently no particular reason the university would need to voluntarily recognize SFUFA, it could be in their interest to do so, particularly if voluntary recognition might mean the maintenance of binding arbitration rather than the right to strike. Such an arrangement would limit our ability to take job action, but would achieve unionization and significantly strengthen SFUFA without a drawn-out certification campaign. It might also ease concerns members may have about strike action. Benefits and Risks of Unionization: Leaving aside ideological commitments for or against unions in general, unionization carries both benefits and risks to faculty members and to SFUFA as an organization. As in all decisions, individuals will likely decide to vote for or against unionization based on where their own cost-benefit assessment falls. The following are some of the key areas that people might consider: Association-Administrative Relations: Pros: While SFUFA has highly collegial relationships with the university administration at present, the association is most certainly a ‘junior partner’ – provided the opportunity to meet and talk and raise questions, but all-too-easily ignored in substantive terms. Unionization would strengthen the mandate of the association politically and achieve an improved balance of power between SFUFA and the administration. Cons: The union-employer relationship is different than that we currently have, and we could experience a certain amount of hardening of positions at least in the shorter term while both sides adjust to the new arrangement. The ability to get the ear of the president for a quiet word, to emphasize relations of goodwill over relations of legal status and rights – these have their limitations, but should certainly not be considered insignificant. Faculty Voice: Pro: Unionization would significantly strengthen the mandate of SFUFA and strengthen, too, its ability to legally defend faculty interests. It would provide a strong signal that members expect to be actively consulted, and allow the association to challenge unilateral decisions that adversely affect faculty but are not currently within our scope of bargaining. In an era in which the mechanisms of collegial governance seem to be weakening, unionization can be an effective method of protecting a democratic collective voice for members. Con: Any mechanism for a collective voice can create the illusion of consensus and alienate divergent opinions. Faculty and academic staff do not take unanimous positions on many issues, and there will no doubt be any number of issues on which an individual member’s views are not consistent with the position taken by the union. This is, of course, already the case, unionization or no. However, the strengthened mandate unionization provides may exacerbate these kinds of differences. Negotiation and bargaining: Pros: Unionization widens the scope of bargaining and lessens the unilateral power of the administration in a number of ways. It would bring together economic bargaining and discussions about policies and practices, allowing the association to bargain an overall agreement with greater consistency and ensuring that process and policy matters are included in each negotiating round. Cons: As unionization limits the unilateral power of the administration, it also can cause increased resistance to change and can heighten administrative concern about matters of precedent and long-term legal implications of agreements. That is, the flexibility management holds by virtue of policy can at times make it easier to achieve particular gains and can encourage new practices to be rolled out on a trial basis. Financial matters: Pros: As a general rule, unionized workers earn more than their non-unionized counterparts, and have greater equity in salaries and benefits as well. It is worth bearing in mind, however, that given both the peculiarities of the university sector and the role of the provincial government in constraining wages in the public sector, salary increases in BC in the foreseeable future cannot be presumed to follow the same pattern as they have historically. Cons: Unionized associations have greater responsibilities and represent members in a wider array of matters, with the result that costs to the association can increase. A dues increase is not a necessary outcome – i.e. UBCFA has a similar dues rate to ours, and is a unionized association – but costs of representation could rise, particularly in the shorter term while both parties adjust to a unionized environment. Flexibility: Pros: The expansion of the scope of bargaining would allow SFUFA to limit SFU’s arbitrary exercise of authority in a number of areas. Much of policy is now entirely controlled by the university, and even where consultation occurs, the incorporation of faculty feedback is often entirely discretionary. A more robust collective bargaining arrangement would certainly constrain the unilateral power of the university administration. Cons: Unionization does not mean that rules and regulations must necessarily interfere with the decentralization and collegial decision-making the university requires. However, agreements do bind faculty members as much as they bind the university, and departments and individuals would almost certainly experience some level of constraint in their own flexibility. The above are just a few of the trade-offs members and the association need to consider in deciding whether to proceed with unionization. The shift need not be life-changing, and in many universities much carries on as before. However, there would be changes – generally changes that would strengthen the association, but at the cost of a flexibility which benefits not only the administration but academic units and individual members as well. It is, then, for many not an easy choice to make; we can only encourage members to consider carefully all sides of the issue, to weigh the benefits and costs, and to make reasonable and informed decisions on how to proceed. Could We Lose Our Existing Protections? One concern of unionization is the possible loss of existing provisions that we enjoy. For example, the language on academic freedom in place at SFU is one of the strongest provisions in the country; might unionization require us to start from scratch and jeopardize such protections? The answer is not as simple as yes or no. The potential to lose existing provisions does of course exist anytime an agreement is opened for negotiation. But of particular concern at SFU is a provision in our Framework Agreement which states that the Agreement “shall lapse if and when the Faculty Association obtains certification under the provisions of the labour legislation of the Province of British Columbia”. This suggests that unionization might result in the immediate suspension of all existing protections. SFUFA has sought legal advice on this question specifically, and understands that the above language certainly does not mean all is lost, and may in fact be contrary to the law. The Labour Code includes provisions that come into effect immediately when an organization begins a certification process – specifically, the Code stipulates that no change in terms and conditions of employment can be imposed while the certification process is pending (section 32[1]) nor can any such changes be introduced following certification for either four months or the signing of a collective agreement, whichever comes first (section 45[1b]. What this means is that both during the process of certification and for four months afterwards, all current conditions would in fact be protected by law unless a collective agreement was signed in that period. Certification, then, does not mean the elimination of current protections, and the law would provide a period of four months for negotiations before it would be legally possible for any change to terms and conditions to be imposed. If, however, we failed to negotiate an agreement in that time, the university could be in a position to unilaterally alter its policies. STRIKE! The Big Question: Say unionization and people hear ‘strike’. There is no question that the strike and the union are firmly connected in public consciousness. However, while most unions have the right to strike, and while each year we hear about strikes across the country and across sectors, two things are worth noting very clearly: unionization does NOT necessarily mean adopting the strike as a tool, and many unions do in fact rely on binding arbitration as their dispute resolution mechanism; even where unions do have the right to strike, strikes are rare occurrences overall, and no strike ever begins without the express and explicit decision of the members themselves. The discussion about unionization for SFU faculty members is NOT a discussion about the right to strike. That is a secondary question that would be separately decided were SFUFA to unionize. Members would have to ratify a collective agreement that included provisions for dispute resolution, and could opt to take the right to strike or to maintain binding arbitration. At UBC, for example, the Faculty Association has explicitly chosen to retain arbitration and to prohibit strikes and lockouts, and there is no reason members here could not do the same. Strikes can be an effective tool, without question; and for many, the right to strike is seen as the most fundamental benefit of unionization. However, unions can and do rely on arbitration instead, achieving the benefits of unionization while avoiding the conflict and division that many people associate with strike action. Next Steps: SFUFA has been directed by its members to explore unionization and to report on the various issues the question of unionization raises. Over the next several months, we will be working to share information with you and identify opportunities for discussion and debate in anticipation of our report to the Fall General Meeting. This memo has set out just a few of the factors you might consider. We welcome any questions, and will endeavour to answer as fairly and objectively as we can, recognizing the diversity of opinion that exists already, and recognizing, too, that there are good reasons to both support or oppose this step. Ultimately the Executive Committee believes this is a question that can be answered only by the members of SFUFA themselves, and while individual executive members have individual opinions, the association as an organization will neither speak in favour of nor against the idea at this time, but will await further direction from the membership and act accordingly. Further information on unionization, and on events and opportunities to discuss the matter, will be provided as it becomes available. ‹ Salary Arbitration Award SFUFA Spring Social, May 28 › The SFU Faculty Association is a member-driven professional association and collective bargaining agent for faculty, librarians and other academic staff at the three campuses of Simon Fraser University in Burnaby, Surrey and Vancouver B.C. Find us at: AQ 2035 at the Burnaby Campus Phone: 778-782-4676 Email: sfufa@sfu.ca Mailing address: 8888 University Drive Burnaby, BC V5A 1S6 © 2014 SFU Faculty Association | Log in ↑ Responsive Theme powered by WordPress)
(20140305,thetyee.ca,http://thetyee.ca/News/2012/11/07/Union-Raises-Manual/,HTTP/1.1 200 OK Date: Wed, 05 Mar 2014 18:23:03 GMT Server: Apache Vary: Accept-Encoding Connection: close Content-Type: text/html The Tyee – 'Confidential' Manual Admits It's Tough Finding Savings for Union Raises About Advertise Follow Support RSS Topic Aboriginal Affairs BC Election 2013 BC Politics Education Energy Environment Federal Politics Film Food Gender + Sexuality Health Housing Labour + Industry Local Economy Media Municipal Politics Music Photo Essays Podcasts Politics Rights + Justice Science + Tech Transportation Travel Urban Planning + Architecture The Hook BC Blog Directory Answer: Become a Builder. Question: More media diversity? 'Confidential' Manual Admits It's Tough Finding Savings for Union Raises BC gov't guide advises employers negotiating contracts. By Andrew MacLeod, 7 Nov 2012, TheTyee.ca   Tweet COPE 378 president David Black: 'We were suspicious about what "co-operative gains" meant in the first place.' Gov't Mum on How Union Raises Will Be Paid Unions say new efficiencies, lower sick pay and overtime will fill the gap. Funding Murky for Union Raises at UBC, UVic Budget's already squeezed before seeking 'efficiencies' says UBC spokesperson. BC to withhold BCNU contract details until other unions reach deals Labour + Industry Find more labour reporting on The Tyee. Read more: Politics, Labour + Industry, Many parts of the public sector will have a hard time finding the budget cuts needed to pay for wage increases, acknowledges a "strictly confidential" guide for employers in the current round of bargaining in British Columbia. "Many employers will not be able to find savings needed to generate funding for a modest wage increase," says the provincial government guide The Tyee obtained. "This may be an opportunity to work with unions to find other, non-monetary improvements to collective agreements." The statement is included in the "Frequently Asked Questions" section of the Employers' Guide to 2012 Co-operative Gains Mandate, a 32-page document marked "Prepared for Collective Bargaining -- Strictly Confidential." Even for employers who can find the money in their budgets, the government will not allow them to pass as much as they might like on in wage increases to their employees. "There is no set wage increase determined for the mandate, but we will limit maximum average increases in the deals to reduce the variability of potential outcomes," it said. "Some employers will have difficulties finding savings to fund increases and we want all groups to be treated fairly ... Employers are facing a difficult task to find any savings at all." So far several unions have reached tentative agreements under the mandate, some of which members have ratified, but none so far has received a wage increase greater than four per cent over two years. Some 370,000 British Columbians work in the public sector, with compensation costing nearly $24 billion a year. Savings plans required The government's guide explains to public sector employers -- which include things like school boards, universities, Crown corporations and health authorities as well as direct government employees -- what the province expects under the co-operative gains mandate round of bargaining. Employers were required to submit savings plans and bargaining plans to the government for approval before beginning negotiations on anything substantive. "The Province will not provide additional funding for increases to compensation negotiated in collective bargaining," the guide said. "Employers are directed to work with responsible ministries and employer bargaining agents to develop savings plans to free up funding from within existing budgets to provide modest compensation increases." Savings could come from operational cost reductions, increased efficiency, service redesign, increases in revenue, and other initiatives, it said. "Identified savings must be real, measurable, and incremental to savings identified by employers to meet Provincial Budget and deficit reduction targets for 2012/13 and beyond." It added, "Savings proposals must be supported by the best possible evidence available." The costing of the savings would have to use "realistic and conservative assumptions," it said. There also had to be a way to measure whether they had actually been achieved. The money could not, however, be found by reducing service to the public or by transferring the cost of existing services to the public, it said, which is also what the government has said publicly about the mandate. "The 2012 Co-operative Gains Mandate will be highly sector and employer dependent," the guide said. Factors in the outcome will include the "ability to generate savings and the willingness of unions to co-operate in bargaining." Government ministries would be closely involved in developing savings plans, it said. "Ministries will be responsible for ensuring the accuracy, commitment and ability to track savings in savings plans and that bargaining strategies align with the province policy goals." Bargaining strategy secret Government ministries have for the most part declined to share the details of the savings plans it has approved under the co-operative gains mandate. "We are still negotiating with other unions where savings may be directed and so sharing the savings plan details would be the same as disclosing our bargaining strategy to those unions," said Health Ministry spokesperson Ryan Jabs in an email about the agreement medical residents ratified last week. "Just like in any negotiation these savings plans have to remain confidential to protect the bargaining position of the employer," he said. "In addition, the savings plans connect to government's overall bargaining strategy. Revealing them may affect government's overarching strategy for this round of bargaining and jeopardize the success we are starting to see in reaching agreements under this mandate." Residents are employees of the province's six health authorities. The four-year agreement includes two years with no increases, followed by a 1.5 per cent raise in January 2013 and a 1.3 per cent increase in April of 2013. "High level" examples of where health authorities might save money included things like bulk buying, reducing the cost to the employer of benefit plans and reducing administrative expenses, said Jabs. "By making changes to the benefit plans the savings can be put in employees' pockets as wage increases," he said. "Employees can then choose how they use their wage to support themselves and their families." The total budget for the health authorities is over $13 billion, he said, so "making small changes can result in significant savings to contribute to the agreement." Productivity and efficiency The response from the Advanced Education Ministry's spokesperson was similar when asked about the savings plan Royal Roads University had used to fund wage increases of two per cent in each of two years for CUPE 3886 members. "Negotiations across the public sector are underway and the savings plans are part of each employer's bargaining mandate and strategy," he said on background. "These savings plans have to remain confidential to protect the bargaining position of the employer." Savings plans may span several tables in a sector and are connected to the government's overall bargaining strategy, he said. "Revealing them may affect government's overarching strategy for this round of bargaining and jeopardize the success we are starting to see in reaching agreements under this mandate." The Tyee has previously reported on the savings plans related to a few tentative agreements. A University of Victoria spokesperson said the school is paying for raises by trimming $10 million out of departmental budgets over the next two years. BC Nurses' Union president Deb McPherson said increases to her union were to be paid for through the productivity gains and reduced overtime from increasing the work week to 37.5 hours from 36 hours. The BCGEU's chief negotiator David Vipond has said the government's savings plan to pay for raises to union members included using an Internet tool expected to improve employee health and implementing a form of the Japanese efficiency system "Kaizen" known as "Lean." The government's press release on the ratification of the BCGEU agreement provided more detail on the latter. "Lean is a process improvement method that focuses on saving time by removing unnecessary steps in processes and on delivering value to clients," it said. "Although the focus is not solely on cost, Lean is a proven technique and by saving time in a process, quality improvements and cost savings are often a result." Across government there are 47 Lean projects underway, most in their initial stages, it said. For example, the Public Service Agency is working to reduce the time it takes to process long-term disability applications by 20 per cent, it said. "Early results are promising." The raises to BCGEU members should cost less than $60 million a year before any savings are taken into account, a Finance Ministry spokesperson said. ICBC union 'suspicious' of mandate A tentative agreement announced Nov. 2 gives some 4,600 employees at the Insurance Corporation of BC (ICBC) four raises of one per cent each spread over two years. "It's a relief to finally have something we can recommend to our members," said David Black, the president of COPE 378. The co-operative gains mandate was a minor factor at the negotiating table as far as the union was concerned, he said. "We were suspicious about what 'co-operative gains' meant in the first place," he said. By the time his union started negotiating, they'd already seen the government quash the BCGEU's "very creative suggestions" for things like opening liquor stores on Sunday, he said. In its negotiations with ICBC, COPE 378 suggested ways to raise revenue including selling auto insurance outside of B.C. and offering other types of insurance, such as home insurance, to people inside the province, said Black. "The government had no appetite for that." The lack of interest from the government took the issue of co-operative gains off the table, he said. "We just decided we weren't going to play that game." A spokesperson for ICBC, Adam Grossman, said he couldn't provide any further details until the agreement is ratified. Many negotiations remain underway, and in at least some cases employers are asking unions to reduce their members' benefits to generate the savings for any wage increases, something they are loathe to do after two years without a raise under the government's previous bargaining mandate. Read more: Politics, Labour + Industry, Andrew MacLeod is The Tyee's Legislative Bureau Chief in Victoria. Find him on Twitter or reach him here.   Tweet What have we missed? What do you think? We want to know. Comment below. Keep in mind: Do: Verify facts, debunk rumours Add context and background Spot typos and logical fallacies Highlight reporting blind spots Treat all with respect and curiosity Connect with each other Do not: Use sexist, classist, racist or homophobic language Libel or defame Bully or troll Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Nearly half of Canadians taking part in 'collaborative economy': report BC health ministry reinstates one worker it fired in 2012 New Furlong filing admits 'no actual knowledge' behind a key allegation Anti-bullying site quiet on homophobia: advocate Tsilhqot'in celebrate as feds again block New Prosperity mine VIEW: Choice of BC budget's example families is, well, a little rich Court overrules DFO decision to reopen herring fisheries Court grants government stay on teacher ruling See more » Tyee Video Picks Submitted by Crawford Kilian, 3 Mar 2014 An amazing look at Canada by helicopter in 1966 Where do these videos come from? From you. Suggest one here! Reported Elsewhere Tories kill bid to investigate Brad Butt voter fraud claim (via CBC News) Mayor Rob Ford lambasted during appearance on 'Jimmy Kimmel' (via Global News) Dead soldier's family 'devastated' after getting one-cent cheque from Ottawa (via The Globe & Mail) Canada's obesity rates triple in less than 30 years (via The Huffington Post) Why doctors are the biggest barrier to getting medical marijuana in Canada (via Leaf Science) Canada Post unveils new community mailbox design (via the Toronto Star) Ruins of the future: re-purposed Pizza Huts (via 99% Invisible) The subculture of Japanese trucker art (via Messy Nessy) See more » Your BC: The Tyee's Photo Pool Join The Tyee's Flickr group » View previous selections » About The Tyee Advertise Contact Funding Jobs Master Classes Privacy Policy Submissions Support The Tyee News Arts & Culture Life Opinion Mediacheck Books Video Publications Series Tyee News Weekly Archives The Hook Other BC Blogs Reported Elsewhere Steve Burgess Murray Dobbin Michael Geist Crawford Kilian Rafe Mair Andrew Nikiforuk Shannon Rupp Bill Tieleman Dorothy Woodend Follow The Tyee Subscribe by email Subscribe by RSS Tyee Mobile App Managed Hosting by Gossamer Threads × Share article via email Send this article to: Your email address: Your message: Would you like to receive The Tyee's headlines free by email? Yes, every day. Yes, once a week. I already subscribe. No, thanks. Submit Show form again » Close)
Elapsed Time: 273
Extract image urls, taking 3.
(http://img1.blogblog.com/img/icon18_wrench_allbkg.png,1420)                    
(http://img2.blogblog.com/img/icon18_edit_allbkg.gif,854)
(http://1.bp.blogspot.com/-xQH3SvI8It0/URLji9w5H1I/AAAAAAAAAMY/ArS17Xjc3As/s190/No%2Bcherry-picking%2Bdata.png,355)
Elapsed Time: 6241
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.StringUtils._
import java.time.Instant
warcPath: String = /Users/ryandeschamps/tmpAut/aut/warcs/*.warc.gz
timed: (f: => Unit)Unit

@greebie
Copy link
Contributor

greebie commented Apr 4, 2018

Okay after contending with some Java Heap Space Errors, I got a really good sized set of warcs to run fine. From my perspective, this is ready to merge.

Get urls and count, taking 3.
(www.bcliberals.com,91602)                                                      
(www.bctf.ca,15633)
(www.staffroomconfidential.com,2776)
Elapsed Time: 366342
Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.
((bcliberals.com,bcliberals.com),2412524)                                       
((staffroomconfidential.com,staffroomconfidential.com),284085)
((bcliberals.com,twitter.com),148612)
Elapsed Time: 421201
Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.
((20141004,bcliberals.com,bcliberals.com),1973499)                              
((20141003,bcliberals.com,bcliberals.com),439025)
((20140925,staffroomconfidential.com,staffroomconfidential.com),257457)
Elapsed Time: 406408
Extract text, taking 3 examples.
(20150813,www.youtube.com,https://www.youtube.com/embed/c1em8yVbR4k?rel=0,HTTP/1.0 200 OK Date: Thu, 13 Aug 2015 17:56:42 GMT Server: gwiseguy/2.0 X-XSS-Protection: 1; mode=block; report=https://www.google.com/appserve/security-bugs/log/youtube P3P: CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=en for more info." Content-Type: text/html; charset=utf-8 Cache-Control: no-cache Expires: Tue, 27 Apr 1971 19:44:06 EST X-Content-Type-Options: nosniff Set-Cookie: VISITOR_INFO1_LIVE=RQVfbN0TJoo; path=/; domain=.youtube.com; expires=Wed, 13-Apr-2016 05:49:42 GMT; httponly Set-Cookie: VISITOR_INFO1_LIVE=RQVfbN0TJoo; path=/; domain=.youtube.com; expires=Wed, 13-Apr-2016 05:49:42 GMT; httponly Set-Cookie: YSC=Rd5JEU0EeCo; path=/; domain=.youtube.com; httponly Alternate-Protocol: 443:quic,p=1 Alt-Svc: quic=":443"; p="1"; ma=604800 Accept-Ranges: none Vary: Accept-Encoding K-12 Innovation Forum: Comox Valley - YouTube An error occurred. Try watching this video on www.youtube.com, or enable JavaScript if it is disabled in your browser.)
(20150813,www.youtube.com,https://www.youtube.com/embed/k0t0rvjQ8zQ?rel=0,HTTP/1.0 200 OK Date: Thu, 13 Aug 2015 17:56:43 GMT Server: gwiseguy/2.0 X-Content-Type-Options: nosniff Expires: Tue, 27 Apr 1971 19:44:06 EST Content-Type: text/html; charset=utf-8 Cache-Control: no-cache X-XSS-Protection: 1; mode=block; report=https://www.google.com/appserve/security-bugs/log/youtube Alternate-Protocol: 443:quic,p=1 Alt-Svc: quic=":443"; p="1"; ma=604800 Accept-Ranges: none Vary: Accept-Encoding K-12 Innovation Highlight: Comox Valley - YouTube An error occurred. Try watching this video on www.youtube.com, or enable JavaScript if it is disabled in your browser.)
(20150813,www.youtube.com,https://www.youtube.com/embed/CAPvUYS_QKc?rel=0,HTTP/1.0 200 OK Date: Thu, 13 Aug 2015 17:56:44 GMT Server: gwiseguy/2.0 Expires: Tue, 27 Apr 1971 19:44:06 EST Cache-Control: no-cache X-Content-Type-Options: nosniff Content-Type: text/html; charset=utf-8 X-XSS-Protection: 1; mode=block; report=https://www.google.com/appserve/security-bugs/log/youtube Alternate-Protocol: 443:quic,p=1 Alt-Svc: quic=":443"; p="1"; ma=604800 Accept-Ranges: none Vary: Accept-Encoding K-12 Innovation Highlights: Central Okanagan - YouTube An error occurred. Try watching this video on www.youtube.com, or enable JavaScript if it is disabled in your browser.)
Elapsed Time: 113
Extract image urls, taking 3.
(http://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif,45715)                          
(http://img2.blogblog.com/img/b36-rounded.png,37990)
(http://img1.blogblog.com/img/icon18_wrench_allbkg.png,36063)
Elapsed Time: 435895
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util.StringUtils._
import java.time.Instant
warcPath: String = /Users/ryandeschamps/tmpAut/aut/warcspt2/*.warc.gz
timed: (f: => Unit)Unit

@ianmilligan1
Copy link
Member

ianmilligan1 commented Apr 4, 2018

As I work on the docs - when does import io.archivesunleashed.util.StringUtils._ need to be called?

nevermind - just when it was called previously?

@lintool
Copy link
Member Author

lintool commented Apr 4, 2018

StringUtils gives you ExtractDomain(r._1).removePrefixWWW().

Should I fold it so it gets auto-imported in import io.archivesunleashed.matchbox._?

This is a tradeoff between convenience/magic and "knowing exactly what you're doing". If there's too much magic, it become difficult to understand certain behaviors...

@ianmilligan1
Copy link
Member

My vote would be to fold it in so it gets auto-imported in import io.archivesunleashed.matchbox._ as it's fairly core? Then we could pretty consistently just have the standard imports i.e.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

and not have to worry about explaining when to use x and/or y, etc.

@ianmilligan1
Copy link
Member

I have this branch of AUT hooked up to AUK, btw, so will run some tests overnight. So far on smaller collections all looking good.

@greebie
Copy link
Contributor

greebie commented Apr 4, 2018

I think StringUtils is a good utility for most aut jobs but JsonUtils would be better to keep out unless someone specifically wants Json-y output. Same for ImageUtils if that idea becomes part of the refactoring (an alternative might be to do Image-related work [not image urls] as an aut plugin).

@lintool
Copy link
Member Author

lintool commented Apr 4, 2018

StringUtils moved to matchbox package-level implicit for auto-import, as requested.

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine to me. Just a couple doc comment, and license header issues to sort out.

Once @ianmilligan1 and @greebie are satisfied with their testing, and I get some time to do a smoke test, I'll merge this.

After that, I'll update https://github.com/archivesunleashed/aut/tree/issue-184 and @helgeho will need to update #186.

Sound good?

import scala.xml.Utility._

/**
* Created by jimmylin on 4/4/18.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we can an actual description here? Created by isn't needed since it'll show up in the commit history as authored by you.

@@ -1,34 +1,31 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave the license headers in?


import scala.reflect.ClassTag
import scala.util.matching.Regex

/**
* RDD wrappers for working with Records
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment?

@lintool
Copy link
Member Author

lintool commented Apr 5, 2018

Addressed comments.

@lintool
Copy link
Member Author

lintool commented Apr 5, 2018

Hrm... something is very screwy. All I did was add documentation. How can it possibly have affected test coverage?

@ruebot
Copy link
Member

ruebot commented Apr 5, 2018

It's total lines. I haven't fine-tuned the codecov configuration. If it's around 0.5% I usually ignore it.

@ianmilligan1
Copy link
Member

Looking good! I'm running some trials overnight using AUK, and should hopefully be able to give my thumbs up tomorrow morning. 👍

@ianmilligan1
Copy link
Member

ianmilligan1 commented Apr 5, 2018

Some failures on WriteGraphML with this new version while doing AUK testing.

<console>:33: error: not found: value WriteGraphML
             WriteGraphML(links, "/users/ianmilligan1/desktop/auk-download/990/8427/1/derivatives/gephi/8427-gephi.graphml")

Have you tested that function @greebie?

This is while running some variation of

      import io.archivesunleashed._
      import io.archivesunleashed.matchbox._
      sc.setLogLevel("INFO")
      RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("#{collection_derivatives}/all-domains/output")
      RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("#{collection_derivatives}/all-text/output")
      val links = RecordLoader.loadArchives("#{collection_warcs}", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
      WriteGraphML(links, "#{collection_derivatives}/gephi/#{c.collection_id}-gephi.graphml")

@lintool
Copy link
Member Author

lintool commented Apr 5, 2018

WriteGraphML got moved to io.archivesunleashed.app so adding

import io.archivesunleashed.app._

should work.

@ianmilligan1
Copy link
Member

Oh, I see - ok, let me try again. 👍

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running this continuously using AUK (and some spot checks with spark shell) and it is all working now. It's good to merge as far as I'm concerned.

@greebie greebie self-requested a review April 5, 2018 17:46
@greebie
Copy link
Contributor

greebie commented Apr 5, 2018

Just thought I'd add a check to confirm I'm happy with the current PR.

@ruebot
Copy link
Member

ruebot commented Apr 6, 2018

Smoke test is good. Merging now.

@ruebot ruebot merged commit c5f07c6 into master Apr 6, 2018
@ruebot ruebot deleted the refactoring branch April 6, 2018 13:59
ruebot added a commit that referenced this pull request Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants