A general "refresh" to the Accumulo binding #947

joshelser · 2017-03-27T04:01:06Z

Expand on the README, covering table creation and best-practices for table config
Avoid unnecessary Text object creations (in loops and instead of byte[] usage)
Use a ConcurrentHashMap to better match the DB API
Fix error messages and always call printStackTrace() on exceptions
Use BATCHED_OK instead of OK in insert() (more correct)

Tagging some potentially interested folks.. @busbey @madrob @keith-turner

phrocker

Seems fine from an Accumulo perspective. A few minor comments that don't change my opinion.

phrocker · 2017-03-27T14:36:47Z

accumulo/src/main/java/com/yahoo/ycsb/db/accumulo/AccumuloClient.java

-        // Couldn't spit out the mutations we wanted.
-        // Ignore this for now.
-        System.err.println("MutationsRejectedException: " + e.getMessage());
+  public BatchWriter getWriter(String t) throws TableNotFoundException {


why not just call it table? Vestigial?

Ahh, good catch. This was originally because of the table member (checkstyle rule about a local variable masking a class member). Will fix.

phrocker · 2017-03-27T14:38:11Z

accumulo/src/main/java/com/yahoo/ycsb/db/accumulo/AccumuloClient.java

+    if (null != numThreadsValue) {
+      numThreads = Integer.parseInt(numThreadsValue);
+    }
+    System.out.println("Using " + numThreads + " threads to write data");


Does it matter if this println remains? Given what the harness does I'm inclined not to really care if this is here.

This should be on stderr of it's going to be manually printed. Otherwise it'll mess up folks using stdout to read results.

Otherwise it'll mess up folks using stdout to read results.

Will do!

phrocker · 2017-03-27T14:39:07Z

accumulo/src/main/java/com/yahoo/ycsb/db/accumulo/AccumuloClient.java

            new ByteArrayByteIterator(buf));
      }
    } catch (Exception e) {
-      System.err.println("Error trying to reading Accumulo table" + key + e);
+      System.err.println("Error trying to reading Accumulo table " + t + " " + key);
+      e.printStackTrace();


phrocker · 2017-03-27T14:42:13Z

accumulo/src/main/java/com/yahoo/ycsb/db/accumulo/AccumuloClient.java

+      Iterator<BatchWriter> iterator = writers.values().iterator();
+      while (iterator.hasNext()) {
+        BatchWriter writer = iterator.next();
+        writer.close();


Can't this throw MutationsRejectedException? Should we safeguard this and remove it from the writers anyway? Will it matter at this point in the process?

At the cleanup point, there isn't much else we can do but throw an exception. The cleanup here is best-effort, but probably unnecessary (I don't have a use-case for how YCSB would reasonably recover from this state)

keith-turner · 2017-03-27T15:31:24Z

accumulo/src/main/java/com/yahoo/ycsb/db/accumulo/AccumuloClient.java

+    // a concurrent data structure is overkill (especially in such a hot code path).
+    // However, the impact seems to be relatively negligible in trivial local tests and it's
+    // "more correct" WRT to the API.
+    BatchWriter writer = writers.get(t);


Did you look into using MultiTableBatchWriter.getBatchWriter(String)? It does what you are doing here, but its performance may be worse.

I looked at the MTBW implementation, it may do more reads to main memory than what you have in this patch. The MTBW is trying to handle the case of a table being renamed, which causes more reads to main memory.

I looked at the MTBW implementation, it may do more reads to main memory than what you have in this patch. The MTBW is trying to handle the case of a table being renamed, which causes more reads to main memory.

I did forget that MTBW was a thing. It would be nice if YCSB ever actually provided more than one table to write to. It would be nice to explore this as a follow-on.

busbey · 2017-03-27T17:47:07Z

Would switching to a logging api still be in scope for the amount of effort you're looking to do here?

busbey · 2017-03-27T17:48:28Z

Please start the commit message with "[accumulo]".

joshelser · 2017-03-27T17:55:22Z

Would switching to a logging api still be in scope for the amount of effort you're looking to do here?

Could be. I wasn't sure if there was a reason a logger wasn't used in the first place :)

Please start the commit message with "[accumulo]".

Will do.

Thanks for the review, folks.

busbey · 2017-03-27T17:49:01Z

accumulo/README.md

+Ruby script, based on the HBase README, can generate adequate split-point. 10's of Tablets per
+TabletServer is a good starting point.
+
+    $ echo 'num_splits = 20; puts (1..num_splits).map {|i| "user#{1000+i*(9999-1000)/num_splits}"}' | ruby > /tmp/splits.txt


busbey · 2017-03-27T17:50:49Z

accumulo/README.md

+
+On repeated data loads, the following commands may be helpful to re-set the state of the table quickly.
+
+    accumulo> createtable tmp -cs usertable -cc usertable


I don't remember if these flags have long versions that more clearly self-document setting things up. If they do, those would be preferred.

Yup, I've switched the non-obvious ones over (everything but -t and -s).

busbey · 2017-03-27T17:51:55Z

accumulo/README.md

@@ -36,7 +36,38 @@ Git clone YCSB and compile:
    cd YCSB
    mvn -pl com.yahoo.ycsb:aerospike-binding -am clean package

-### 3. Load Data and Run Tests
+### 3. Create the Accumulo table


Could you spell out what versions these commands work with?

Looks like >=1.7 would work with what's in the readme, with table durability and the server-side mutation buffer lacking prior. I've tried to clarify this.

busbey · 2017-03-27T20:56:13Z

the travis failure is a checkstyle violation for line length, fyi.

joshelser · 2017-03-28T15:23:15Z

the travis failure is a checkstyle violation for line length, fyi.

Ah, thanks. I was unintentionally ignoring that check :)

I'll fix that up and then squash/rebase these together

* Expand on the README, covering table creation and best-practices for table config * Avoid unnecessary Text object creations (in loops and instead of byte[] usage) * Use a ConcurrentHashMap to better match the DB API * Fix error messages and always call printStackTrace() on exceptions * Use BATCHED_OK instead of OK in insert() (more correct)

joshelser · 2017-04-06T21:35:46Z

@busbey are you able to merge this for me? :)

risdenk · 2017-04-06T21:44:34Z

@joshelser looks like all comments were addressed. I can merge it.

PS - Guess we keep crossing internet paths today :P

joshelser · 2017-04-06T21:57:09Z

Thanks @risdenk!

PS - Guess we keep crossing internet paths today :P

I like it :)

…per#947) * Expand on the README, covering table creation and best-practices for table config * Avoid unnecessary Text object creations (in loops and instead of byte[] usage) * Use a ConcurrentHashMap to better match the DB API * Fix error messages and always call printStackTrace() on exceptions * Use BATCHED_OK instead of OK in insert() (more correct)

* master: [core] Fixing squid:S1319 - Declarations should use Java collection interfaces such as "List" rather than specific implementation classes such as "LinkedList". (manolama - updated bindings added since the PR) [core] Use longs instead of ints to support larger key spaces. Changed int to long in Measurements code to support large scale workloads. (manolama - fixed checkstyle errors) [core] Export totalHistogram for HdrHistogram measurement [core] Add an operation enum to the Workload class. This can eventually be used to replace the strings. [core] Add a Fisher-Yates array shuffle to the Utils class. [core] Fix an issue where the threadid and threadCount were not passed to the workload client threads. Had to use setters to get around the checkstyle complaint of having too many parameters. Upgrading googlebigtable to the latest version. The API used by googlebigtable has had quite a bit of churn. This is the minimal set of changes required for the upgrade. [geode] Update to apache-geode 1.2.0 release [core] Update to use newer version of Google Cloud Spanner client and associated required change [core] Add a reset() method to the ByteIterator abstract and implementations for each of the children. This lets us re-use byte iterators if we need to access the values again (when applicable). [hbase12] Add HBase 1.2+ specific client that relies on the shaded client artifact provided by those versions. (brianfrankcooper#970) [distro] Refresh Apache licence text (brianfrankcooper#969) [memcached] support binary protocol (brianfrankcooper#965) [accumulo] A general "refresh" to the Accumulo binding (brianfrankcooper#947) [cloudspanner] Add binding for Google's Cloud Spanner. (brianfrankcooper#939) [aerospike] Change the write policy to REPLACE_ONLY (brianfrankcooper#937)

phrocker approved these changes Mar 27, 2017

View reviewed changes

keith-turner reviewed Mar 27, 2017

View reviewed changes

busbey reviewed Mar 27, 2017

View reviewed changes

joshelser force-pushed the accumulo-refresh branch from c16977f to 144e9f6 Compare March 27, 2017 19:39

busbey approved these changes Mar 27, 2017

View reviewed changes

joshelser force-pushed the accumulo-refresh branch from 5026a87 to 94358fd Compare March 28, 2017 15:28

risdenk approved these changes Apr 6, 2017

View reviewed changes

risdenk self-assigned this Apr 6, 2017

risdenk added this to the 0.13.0 milestone Apr 6, 2017

risdenk added the db-update label Apr 6, 2017

risdenk merged commit ddde8e3 into brianfrankcooper:master Apr 6, 2017

manolama mentioned this pull request Sep 22, 2017

Release 0.13.0 #981

Closed

busbey mentioned this pull request May 19, 2018

Release 0.14.0 #1117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A general "refresh" to the Accumulo binding #947

A general "refresh" to the Accumulo binding #947

joshelser commented Mar 27, 2017

phrocker left a comment

phrocker Mar 27, 2017

joshelser Mar 27, 2017

phrocker Mar 27, 2017

busbey Mar 27, 2017

joshelser Mar 27, 2017

phrocker Mar 27, 2017

phrocker Mar 27, 2017

joshelser Mar 27, 2017

keith-turner Mar 27, 2017 •

edited

Loading

joshelser Mar 27, 2017

busbey commented Mar 27, 2017

busbey commented Mar 27, 2017

joshelser commented Mar 27, 2017

busbey Mar 27, 2017

busbey Mar 27, 2017

joshelser Mar 27, 2017

busbey Mar 27, 2017

joshelser Mar 27, 2017

busbey commented Mar 27, 2017

joshelser commented Mar 28, 2017

joshelser commented Apr 6, 2017

risdenk commented Apr 6, 2017

joshelser commented Apr 6, 2017


		On repeated data loads, the following commands may be helpful to re-set the state of the table quickly.

		accumulo> createtable tmp -cs usertable -cc usertable

A general "refresh" to the Accumulo binding #947

A general "refresh" to the Accumulo binding #947

Conversation

joshelser commented Mar 27, 2017

phrocker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keith-turner Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

busbey commented Mar 27, 2017

busbey commented Mar 27, 2017

joshelser commented Mar 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

busbey commented Mar 27, 2017

joshelser commented Mar 28, 2017

joshelser commented Apr 6, 2017

risdenk commented Apr 6, 2017

joshelser commented Apr 6, 2017

keith-turner Mar 27, 2017 •

edited

Loading