Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A general "refresh" to the Accumulo binding #947

Merged
merged 1 commit into from
Apr 6, 2017

Conversation

joshelser
Copy link
Contributor

  • Expand on the README, covering table creation and best-practices for table config
  • Avoid unnecessary Text object creations (in loops and instead of byte[] usage)
  • Use a ConcurrentHashMap to better match the DB API
  • Fix error messages and always call printStackTrace() on exceptions
  • Use BATCHED_OK instead of OK in insert() (more correct)

Tagging some potentially interested folks.. @busbey @madrob @keith-turner

Copy link

@phrocker phrocker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine from an Accumulo perspective. A few minor comments that don't change my opinion.

// Couldn't spit out the mutations we wanted.
// Ignore this for now.
System.err.println("MutationsRejectedException: " + e.getMessage());
public BatchWriter getWriter(String t) throws TableNotFoundException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just call it table? Vestigial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good catch. This was originally because of the table member (checkstyle rule about a local variable masking a class member). Will fix.

if (null != numThreadsValue) {
numThreads = Integer.parseInt(numThreadsValue);
}
System.out.println("Using " + numThreads + " threads to write data");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if this println remains? Given what the harness does I'm inclined not to really care if this is here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be on stderr of it's going to be manually printed. Otherwise it'll mess up folks using stdout to read results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it'll mess up folks using stdout to read results.

Will do!

new ByteArrayByteIterator(buf));
}
} catch (Exception e) {
System.err.println("Error trying to reading Accumulo table" + key + e);
System.err.println("Error trying to reading Accumulo table " + t + " " + key);
e.printStackTrace();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^same?

Iterator<BatchWriter> iterator = writers.values().iterator();
while (iterator.hasNext()) {
BatchWriter writer = iterator.next();
writer.close();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this throw MutationsRejectedException? Should we safeguard this and remove it from the writers anyway? Will it matter at this point in the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the cleanup point, there isn't much else we can do but throw an exception. The cleanup here is best-effort, but probably unnecessary (I don't have a use-case for how YCSB would reasonably recover from this state)

// a concurrent data structure is overkill (especially in such a hot code path).
// However, the impact seems to be relatively negligible in trivial local tests and it's
// "more correct" WRT to the API.
BatchWriter writer = writers.get(t);
Copy link

@keith-turner keith-turner Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look into using MultiTableBatchWriter.getBatchWriter(String)? It does what you are doing here, but its performance may be worse.

I looked at the MTBW implementation, it may do more reads to main memory than what you have in this patch. The MTBW is trying to handle the case of a table being renamed, which causes more reads to main memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the MTBW implementation, it may do more reads to main memory than what you have in this patch. The MTBW is trying to handle the case of a table being renamed, which causes more reads to main memory.

I did forget that MTBW was a thing. It would be nice if YCSB ever actually provided more than one table to write to. It would be nice to explore this as a follow-on.

@busbey
Copy link
Collaborator

busbey commented Mar 27, 2017

Would switching to a logging api still be in scope for the amount of effort you're looking to do here?

@busbey
Copy link
Collaborator

busbey commented Mar 27, 2017

Please start the commit message with "[accumulo]".

@joshelser
Copy link
Contributor Author

Would switching to a logging api still be in scope for the amount of effort you're looking to do here?

Could be. I wasn't sure if there was a reason a logger wasn't used in the first place :)

Please start the commit message with "[accumulo]".

Will do.

Thanks for the review, folks.

Ruby script, based on the HBase README, can generate adequate split-point. 10's of Tablets per
TabletServer is a good starting point.

$ echo 'num_splits = 20; puts (1..num_splits).map {|i| "user#{1000+i*(9999-1000)/num_splits}"}' | ruby > /tmp/splits.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!


On repeated data loads, the following commands may be helpful to re-set the state of the table quickly.

accumulo> createtable tmp -cs usertable -cc usertable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if these flags have long versions that more clearly self-document setting things up. If they do, those would be preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I've switched the non-obvious ones over (everything but -t and -s).

@@ -36,7 +36,38 @@ Git clone YCSB and compile:
cd YCSB
mvn -pl com.yahoo.ycsb:aerospike-binding -am clean package

### 3. Load Data and Run Tests
### 3. Create the Accumulo table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you spell out what versions these commands work with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like >=1.7 would work with what's in the readme, with table durability and the server-side mutation buffer lacking prior. I've tried to clarify this.

@busbey
Copy link
Collaborator

busbey commented Mar 27, 2017

the travis failure is a checkstyle violation for line length, fyi.

@joshelser
Copy link
Contributor Author

the travis failure is a checkstyle violation for line length, fyi.

Ah, thanks. I was unintentionally ignoring that check :)

I'll fix that up and then squash/rebase these together

* Expand on the README, covering table creation and best-practices for table config
* Avoid unnecessary Text object creations (in loops and instead of byte[] usage)
* Use a ConcurrentHashMap to better match the DB API
* Fix error messages and always call printStackTrace() on exceptions
* Use BATCHED_OK instead of OK in insert() (more correct)
@joshelser
Copy link
Contributor Author

@busbey are you able to merge this for me? :)

@risdenk
Copy link
Collaborator

risdenk commented Apr 6, 2017

@joshelser looks like all comments were addressed. I can merge it.

PS - Guess we keep crossing internet paths today :P

@risdenk risdenk self-assigned this Apr 6, 2017
@risdenk risdenk added this to the 0.13.0 milestone Apr 6, 2017
@risdenk risdenk merged commit ddde8e3 into brianfrankcooper:master Apr 6, 2017
@joshelser
Copy link
Contributor Author

Thanks @risdenk!

PS - Guess we keep crossing internet paths today :P

I like it :)

tzm41 pushed a commit to tzm41/YCSB that referenced this pull request May 7, 2017
…per#947)

* Expand on the README, covering table creation and best-practices for table config
* Avoid unnecessary Text object creations (in loops and instead of byte[] usage)
* Use a ConcurrentHashMap to better match the DB API
* Fix error messages and always call printStackTrace() on exceptions
* Use BATCHED_OK instead of OK in insert() (more correct)
jasontedor added a commit to jasontedor/YCSB that referenced this pull request Aug 7, 2017
* master:
  [core] Fixing squid:S1319 -  Declarations should use Java collection interfaces such as "List" rather than specific implementation classes such as "LinkedList". (manolama - updated bindings added since the PR)
  [core] Use longs instead of ints to support larger key spaces. Changed int to long in Measurements code to support large scale workloads. (manolama - fixed checkstyle errors)
  [core] Export totalHistogram for HdrHistogram measurement
  [core] Add an operation enum to the Workload class. This can eventually be used to replace the strings.
  [core] Add a Fisher-Yates array shuffle to the Utils class.
  [core] Fix an issue where the threadid and threadCount were not passed to the workload client threads. Had to use setters to get around the checkstyle complaint of having too many parameters.
  Upgrading googlebigtable to the latest version. The API used by googlebigtable has had quite a bit of churn.  This is the minimal set of changes required for the upgrade.
  [geode] Update to apache-geode 1.2.0 release
  [core] Update to use newer version of Google Cloud Spanner client and associated required change
  [core] Add a reset() method to the ByteIterator abstract and implementations for each of the children. This lets us re-use byte iterators if we need to access the values again (when applicable).
  [hbase12] Add HBase 1.2+ specific client that relies on the shaded client artifact provided by those versions. (brianfrankcooper#970)
  [distro] Refresh Apache licence text (brianfrankcooper#969)
  [memcached] support binary protocol (brianfrankcooper#965)
  [accumulo] A general "refresh" to the Accumulo binding (brianfrankcooper#947)
  [cloudspanner] Add binding for Google's Cloud Spanner. (brianfrankcooper#939)
  [aerospike] Change the write policy to REPLACE_ONLY (brianfrankcooper#937)
@manolama manolama mentioned this pull request Sep 22, 2017
@busbey busbey mentioned this pull request May 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants