email component ---
proxy!proxy!........................................................done!
load page from cloud drive! +++ ...............................................done!
integrate occam's machete's boilerpipe!.......................................................done!
xpath become necessary +++ ...........................................................Long!
restful api support: namely wpost, wput, wupdate & wdelete +++ .......................................Long!
support proxy in wget & restful --.....................................................done!
spike/read document on tika, or UIMA?...................UIMA is too heavyweight, hell no.................
ispark - set up blocking intepret to avoid racing condition & startup ---
implement slice(expand) ---
handle cookie! --- very likely doesn't work in multi-browser situation! remember spring security?
try StringContext ---....................................its for simplified reflective variable interpolation, not useful in our case.
Chinese tokenization.................................................Spike on Friday! Kyde: do you want to pair on this?.....................outsourced
throw exception on ClientAction that compress information and store path of the error page...................................done!
move pageLoadDeadline from const to context so it can be extended when using proxies...................................done!
filter proxy, find the useful ones............................................................................on hold!
lazy execution! wait until point of result-resolving!.........................................................done!
lazy exectuion model ----> export will render all previous interactions.......................................................done!
documentation! documentation! ewwwwwwww ++
merge terry and sandeep........................................................................................done
now extract will preserve orders of columns as identical to they are imported!...................................done
add back the old examples +++.....................................................................................done
bug1: cannot restore from cache, throw ClassNotFoundException...........................................................................done
bug2: cannot use dot in column names, cannot used as symbol...................................................won't fix until https://issues.apache.org/jira/browse/SPARK-2775 is resolved
bug3: cannot extract from other components of PageRow except Page: incomplete solution, will be completely solved after select expression is fully implemented......................done
add timeout for pages.......................................................................done
dual licensing!!!.......................................................................done
deprecate sliceJoin! delegate to flatten/explode +++ .............................................................done
element interpolation! Thanks Liu!.........................................................................................done
collect feedback! ++++ pending for too long................................................................................done
TDD -- depends on previous ---
check if pageExpireAfter is functioning +++......................................................................................done, sporadic deserialization failure, no big deal
ergodic pagination ...........................................................................................done
reference to query lambda, ask Liu for suggestion................................................................................done
support proxies in wget................................................................done, test finished
upgrade ansible PhantomJS config to 1.9.8! ..................................................................done
finish BioCompare.......................................................done, 6944245 rows
support persisting cookies in visit +
non-constructor field (metadata) of actions are lost upon interpolate, fix it!.........................done
merge reusable block & Seq[Action] operations into ActionLike trait ......................done
PageUID contains leaf that denotes the named action that generates the page................done
aggregate some action & Seq[action]'s functions to ActionLike & Trace.......................done
refactor all action chaining & flow control to Trace, this is to simplify query and enable query lambda..........................................................................................done
refactor join by Unix philosophy ..........................................done
DAAAAAAATAAAAAAAABRIIIIIIIIIIICKS certificates! +++....................................................................................DOOOOOONNNNNEEEEE!
-
Fill out a general questionnaire about the product (e.g., integration dependencies, product information, versions of Spark supported).......................done
-
Provide log files demonstrating successful build / operation of application on top of Spark...................................................done
Is Spark listed as a supported platform in the product documentation? If so, can you please provide a copy of the relevant section?.....................done
Are instructions provided as part of the product documentation for installing / configuring the product to work with Spark? If so can you please provide a copy of the relevant section?.........done
Integration Test Output: The product must provide test output that demonstrates successful operation according to the parameters described above: Provide a test plan description for the integration test Describe the particular environment that the test was executed in Provide log files from the output of the test along with a brief description.....................................................................done
need complete documentation and certificates +++........................................................................done
browser pool to avoid repeating launch and deadlock -......0.4
fix the bug where restore doesn't rename Page's name to new ones, and test it!...............done
move wget headers and content-type to SpookyContext..........................................done
integrate simple accumulator for realtime dashboard.........................................done
move all integration test to spookystuff-example.............................................done
all actions should take Expr parameters....................................................................done
investigate vmf and find their security measures...............................................done
deprecate and clean up paginate, slice & extract after their subsitutes are finished ..............................done
set functions of pages to two-stage: page -> elements -> attributes/text.......................done
...also revise Expr[_] to maximize parsimony in join....................................done
fix a bug where missing key in all elements of RDD will cause SchemaRDD.select to try to select non-existing column +++
change fetch to use distinct -> join instead of groupByKey ..................................................useless, won't preserve order anyway
preserve order by SQL automatically!.....................................................done
determine numPartitions automatically!....0.4
officially make PageSchemaRDD an RDD + ...........................................................................done
enable alias ~ for last inserted key ---..........................................useless, replaced by default key
reduce remote resource waiting time but get more driver initialization waiting time
break sc and sql from SpookyContext as they are determined by RDD......0.4
fix repl........even its useless ...............................................................................done
accumulators showing that lazy load may still initialize driver. Investigate that!.........................................not a bug: accumulator increment at wrong position
pageIndexKey should be added into keySet..................................................done
GraphX integration: supports export as Graph.....0.4
fix example....................................................................................done
...and documents.
split spooky intp from ISpark into an independent project...........................done
parsehub/kimino-ish API(probably won't be optimal, size is too big)....................................done
explore is still defective: need to remove added key that cause duplicates.................done
fetch need to read cache from self by doing a dry run on trace and join with all backtraces so explore wont' have high load.............................................................need test
point demo uri to notebook............................................................................................................done
abandon lazy execution in Session and use dryrun to try to load from cache, this can avoid latest snapshot being abandoned due to their earlier version being found in cache..........done
checkpointing seems not working properly, investigate that!...............................................................done
additional 'narrow' engine for explore.................................................................done
PageRowRDD switching to pairRDD to carry metadata and allow more efficient implementation
switch ot key-value in lookup table to leverage partitioner......................................................................done
add condition parameter to loop.........................................................................................partially, Assert can handle it
add feature to combine 2 cells to a map...................................................................................done
convert old query and run
fix macys
add html unit driver & chrome server driver, refer to Selenide if necessary........................................done
add support for Sizzle css selector to avoid breakdown, refer to Selenide if necessary.............................done
query optimization becomes an argument in fetch, join and explore. it only impacts efficiency/redundancy and has no impact on the result........................................done
bug: empty pageRow.expore will still have depthkey being inserted!.................................................................not a bug
lookup & Backtrace: the problem is that results from 2 execution of a trace may looks consistent on UID but have incoherent content. And lookup is too complex at the moment solution: put session creation time into UID: this makes it a real "UID" as pages from 2 sessions won't share it. to lookup an entire exe output seq of a given Trace: one just need an extra condition to check all page's session creation times are identical - or aggregate by session creation time, find the latest, and do a integrity/sanity check.
how to avoid long delay in loops? is it possible?
add operator into/~+ for operator that do + if no field exist in cells, and append the result at the end of the cell if cell is a collection of valid type!.........................done
try different lookup table configuration and optimize to the core: should it be [Trace, Seq[PageLike]]?
metrics/accumulator is still defective, fix it! .......................................................................done
metrics.toJson to allow quick review of stats...............................................................................done
tuning smartExplore to avoid imbalanced partitioning from 1 seed page...........................................not a bug?
stage dumbExplore periodically to avoid memory overflow.........................................................................done
fix the apache org.apache.http.ProtocolException: Invalid redirect URI: bug....................................................done
recalculate the page stats & find a way to fix/circumvent saveAsTextFile error...................................................................done
appliancePartsPros always get 83 fetch from DFS failure? Investigate!.................................done
sigmaAldrich has memory/disk overflow issue in shuffle spill even without lookup, why, is it due to persist or checkpoint?...............shuffle in union, and wrong selector is used in SigmaAldrich!
add support for smartNoLookup optimizer..........................................................................................done
ISpark handling of special charater in generated md is weak, need to fix that by manual escaping or encoding them: use raw HTML element.toString for sanity test
unknown MIME type no longer cause NullPointerException but generates an empty page..................................................done
defaultParallelism now becomes configurable and is twice as large..................................................done
test of data from ambiguous pages joined from identical links has been disabled, enable them again............................................done
enable joinKey & joinLimit in explore again...................................................................................................done
pagination & index tree in all integration test are not deep enough, need a more stressful environment..............................tested on Abcam
backtrace of output by driverless should only be itself...........................................done
try Kryo serializer ....................................................................done
use nutch CrawlDB-style service to avoid low efficiency of lookup table? is it possible?................................too late to change!
the only possible way to do stability test is on extremely deep site (Abcam) and extremely wide site (Sigma-Aldrich)..............................done
install PhantomJS by maven plugin to use TravisCI .........................................not necessary
when Expr returns an array, all actions should only use the first element of it .....................................done
build a hive thrift server that can convert SPARKQL into query plan based on pre-defined prefix and RDF template
add test cases to make sure consecutive flatSelect & join won't use previous temporary values, these includes:
- flatSelect -> flatSelect
- join -> flatSelect
- explore -> flatSelect................................................................................................done
service pipeline: conformity with ML-pipeline:Transformations to enable it to be easily integrated and used
pipeline stage as a service: download pipeline stage from a public hosting service: e.g. GitHub, that is constantly maintained and versioned to upgrade with the website.
define pipeline stage as SPARQL pragma in Hive Server
OAuthIntegration in wget ........................................................................................................done
full SparkSQL integration: since its graduated from alpha it is safe to use Row and DataFrame to allow more seamless integration.
automatically download PhantomJS from Internet....................................................................done
remember derivative relationship of SpookyContext.metrics to aggregate metrics by their children. Eliminate need to set SpookyConf.shareMetrics
if considering redirection, different trace may leads to identical URI, explore may need to consider this to remove duplicates
allow clauses to take normal function as expression
possible uncovered use case? root \ "" \ "selector" in which root \ "" results in Elements.
allow Snapshot & Wget to override contentType .......................................................................done
make declaredContentType only an evidence instead of final type, as it's prone to fail on undisciplined site
add unstructured function to get formatted (indented) code .....................................................done
revise property/sysenv setting of spookyConf +++ ............................................................
OAuth can't handle malformed URL, should kick in after URL is canonized ........................................done
wayback support: read history data from cache and wayback public database ..................only cache, need tests
rewrite expression system using scala reflection API +