"dot or colon?"
Some late night under deadline, Pig will supply you with the absolutely baffling error message "scalar has more than one row in the output". You’ve gotten confused and used the tuple element operation (players.year
) when you should have used the disambiguation operator (players::year
). The dot is used to reference a tuple element, a common task following a GROUP
. The double-colon is used to clarify which specific field is intended, common following a join of tables sharing a field name.
Where to look to see that Pig is telling you have either nulls, bad fields, numbers larger than your type will hold or a misaligned schema.
Things that used to be gotchas, but aren’t, and are preserved here just through the tech review:
-
You can rename an alias, and refer to the new name:
B = A;
works. (v10) -
LIMIT is handled in the loader, and LIMIT accepts an expression (v10)
-
There is an OTHERWISE (else) statement on SPLIT! v10
-
If you kill a Pig job using Ctrl-C or “kill”, Pig will now kill all associated Hadoop jobs currently running. This is applicable to both grunt mode and non-interactive mode.
-
In next Pig (post-0.12.0),
-
CONCAT will accept multiple args
-
store can overwrite existing directory (PIG-259)
-
"Good Habits of SQL Users That Will Mess You Up in Hadoop"
-
Group/Cogroup is king; Join is a special case
-
Window functions are a recent feature — use but don’t overuse Stitch/Over.
-
Everything is immutable, so you don’t need and can’t have transactional behavior
TODO: fill this in with more gotchas
-
A Foolish Optimization
TODO: Make this be more generally "don’t use the O(N) algorithm that works locally" — fisher-yates and top-k-via-heap being two examples TODO: consider pushing this up, earlier in the chapter, if we find a good spot for it
We will tell you about another "optimization," mostly because we want to illustrate how a naive performance estimation based on theory can lead you astray in practice. In principle, sorting a large table in place takes 'O(N log N)' time. In a single compute node context, you can actually find the top K elements in 'O(N log K)' time — a big savings since K is much smaller than N. What you do is maintain a heap structure; for every element past the Kth, if it is larger than the smallest element in the heap, remove the smallest member of the heap and add the element to the heap. While it is true that 'O(N log K)' beats 'O(N log N)', this reasoning is flawed in two ways. First, you are not working in a single-node context; Hadoop is going to perform that sort anyway. Second, the fixed costs of I/O almost always dominate the cost of compute (FOOTNOTE: Unless you are unjustifiably fiddling with a heap in your Mapper.)
The 'O(log N)' portion of Hadoop’s log sort shows up in two ways: The N memory sort that precedes a spill is 'O(N log N)' in compute time but less expensive than the cost of spilling the data. The true 'O(N log N)' cost comes in the reducer: 'O(log N)' merge passes, each of cost 'O(N)'. [1]. But K is small, so there should not be multiple merge passes; the actual runtime is 'O(N)' in disk bandwidth. Avoid subtle before-the-facts reasoning about performance; run your job, count the number of merge passes, weigh your salary against the costs of the computers you are running on, and only then decide if it is worth optimizing.
-
Hadoop commandline tools use the trash to delete items; programs using the hadoop API do not. <remark>verify Pig uses the trash</remark>
-
The tasktracker heap size does not impact the amount of memory available for your tasks. You must adjust
mapred.map.child.java.opts
ormapred.reduce.child.java.opts
. -
Stripe your data across all disks
-
Don’t use the root drive
-
Don’t RAID your disks
-
-
Don’t hit external services — see [server_logs_ddos].
-
Remove files from the trash immediately with
hadoop fs -rm -skipTrash /users/(you)/.Trash
-
Input format for Zip files: ZIP file input format
-
Calculate file checksums: IvoryMonkey by Josh Patterson (@jpatanooga) will calculate a checksum locally that you can compare with the hadoop filesystem’s internal checksum. I’m not aware of a way to calculate the md5 or SHA1 checksum of a file on the HDFS short of doing a map-side job with a non-splitable input format (TODO update command for wukong 3.0 format)
N (log_B(N/M)-1).
[TODO: double check this