workshop should be done (#11)

CrunchyData · Apr 18, 2019 · f6d64be · f6d64be
1 parent 1b8fc60
commit f6d64be
Show file tree

Hide file tree

Showing 5 changed files with 468 additions and 4 deletions.
diff --git a/appdev-wkshp/04-full-text-search.md b/appdev-wkshp/04-full-text-search.md
@@ -1,5 +1,91 @@
+# Full Text Search in PostgreSQL
 
-Build the index in the class and use ts_vector when building the index
-https://www.postgresql.org/docs/11/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX
+While most of you probably know of the SQL search you can do on text with LIKE or ILIKE, did you know that PostgreSQL has 
+a full text analysis program built right in. The search capabilities are similar to those available in Lucene and 
+its derivative such as ElasticSearch and SOLR. In this exercise we are going to show you how to setup and utilize 
+[Full Text Search](https://www.postgresql.org/docs/11/textsearch.html) (FTS). 
 
-https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2116/slides/137/pgconf.eu-2018-fts.pdf
+#### Basic ideas in FTS
+
+There is a [detailed discussion]    (https://www.postgresql.org/docs/11/textsearch-intro.html) in the documentation about the 
+concepts in FTS. For now let's just focus on the steps (along with simplify the actual work).
+
+1. The first step is to take the field(s) which have the content (a document) and analyze the text into words and phrases along with 
+positions in the original text. 
+    * One piece of this analysis is called a tokenizer, which identifies words. numbers, urls... in the original text
+    * The other piece converts these tokens into "lexemes". A lexeme is a normalized words and you need a dictionary to 
+    process for valid lexemes 
+2. Then you store these store lexemes and their positions either in a column or an index
+3. Now you use a search function that understands lexemes and positions in the original document to carry out the search. 
+This search function **must** use the same dictionary that was used to create the lexemes.   
+
+And with that **very** basic introduction let's get to it. We are going to do a FTS on the event narratives in the Storm 
+Events details table.
+
+#### Build the index
+
+Building a FTS index is actually quite simple:
+
+```CREATE INDEX se_details_fulltext_idx ON se_details USING GIN (to_tsvector('english', event_narrative));```{{execute}}
+
+This function will take a little while to run as it tokenizes and lexemes all the content in the column.
+The syntax is basically the same as creating any GIN index. The only difference is that we use the to_tsvector function, 
+passing in the dictionary to use, 'english', and the field to analyze.
+
+If you want to see the other default dictionaries PostgreSQL includes by default just query for it:
+
+` \dF `{{execute}}
+
+Quick note before we use that nice shiny index, there are actual two way to store the results of the text analyzer, 
+either in an index or a in a separate column. 
+While there is a full discussion of the tradeoffs in the [documentation](https://www.postgresql.org/docs/11/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX) 
+it boils down to:
+1. With an index, when data is updated or inserted the index will automatically analyze it. On the downside, you, the 
+application developer, need to know the dictionary that was used and still use the analysis function in your query.
+2. With a column, your SQL syntax is cleaner and your performance with large indices will be better. The downside is you 
+need to write a trigger to update the processed column anytime there is a change to the original document columns.
+
+Today, for simplicity we chose to just use the index approach. If you do end up using a FTS we recommend you do some reading on the 
+solution that works best for you.
+
+#### Using the index
+
+If we want to do a full text search we can now do something like this:
+
+```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('villag');```{{execute}}  
+
+You will notice in the result set we are getting village and villages. To_tsquery is a basic search parser. 
+This query also allows us to use the _:*_ operator and get the search to do full stemming after the end of the word
+
+```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('english', 'villa:*');```{{execute}}
+
+ We can also now look for phrases such as words that appear close together in the document. Let's look for some big hail:
+
+ ```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('grapefruit <1> hail');```{{execute}}
+
+ The *<1>* operator in this case tells the search to look for the words grapefruit and hail next to each other in the document. 
+ As expected this return no results. But if we now change the distance between the words to allow for an intervening word
+ we start to get what we expect:
+
+ ```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('grapefruit <2> hail');```{{execute}}
+
+ The order of the words using the <N> operator is order sensitive. Swapping grapefruit and hail we again get no results:
+
+ ```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('hail <2> grapefruit');```{{execute}} 
+
+You can also use | (OR) and ! (NOT) operators inside the to_tsquery(). Once you start writing more complicated search phrasings 
+you should start to use parentheses to group search together.
+
+The following search will find all documents with grapefruit OR the prefix golf with the word hail two words later.
+
+```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('(grapefruit | golf:* ) <2> hail');```
+
+## Final notes on full text search
+
+As you can see, we can do powerful and fast full text searching with FTS in PostgreSQL. Unfortunately, the documentation 
+on this feature is actually quite sparse and difficult to interpret. If you want to learn this syntax you are going to have to dig in with debugging 
+and trying different query techniques to get your desired results. 
+One other [helpful document](https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2116/slides/137/pgconf.eu-2018-fts.pdf) is a presentation by one of the lead developers.
+Hopefully in the future the PostgreSQL community will update and improve this documentation.  
+
+
diff --git a/appdev-wkshp/05-key-value.md b/appdev-wkshp/05-key-value.md
@@ -0,0 +1,77 @@
+# Storing and querying key-value pairs in a column
+
+Sometimes the data you want to work with comes as key-value pairs and all the keys are not well defined before-hand. If you 
+are a Java programmer this sounds like a Map (such as HashMap) and if you are a Python programmer this sounds like a Dictionary.
+Postgresql allows you to handle this data with an extension named HSTORE. This extension allows for storing arbitrary number 
+of key-value pairs in a column along with operators to search the key-value pairs. The number of key-value pairs can also 
+vary between different rows.
+
+## Two quick notes
+
+While we are going to work with HSTORE today, you should realize that most of this functionality is superceded by the JSONB 
+data type and there is not much ongoing work on HSTORE. A column with arbitrary key-value pairs can also be modeled as a 
+flat JSON document but if, for some reason you don't want to use JSON, feel free to use HSTORE. 
+
+Second, while it is very convenient to dump arbitrary key-value pairs (or JSON for that matter) into a column, we believe this
+pattern of handling data should only be used in limited cases. Using these data types for most of your data has several 
+potential drawbacks
+
+1. JSON and Key-Value data storage can be order of magnitude times larger given that you are repeating the attributes or keys 
+for every row. And while disks are cheap, retrieving more data from disk will always be a performance penalty. Indexing more 
+data will always be more expensive as well.  
+1. You lose the ability of the database to "enforce" that the proper data type, such as integer or float, is being stored 
+in the database
+1. You lose the ability to have the database keep track and manage relations between different data. This can quickly lead
+to data orphans or data being out of sync
+
+
+Make sure the benefits of using these datatypes outweighs the cost before using them.
+
+Our recommended pattern is to process the JSON or Key-Value pairs and, as much as possible, store them in a well defined 
+schema. Again you are free to do what you want, but this is our recommendation.
+
+#### Querying Key-Value data
+We already inserted key-value data into the wikipedia table. We took the response headers from the web request and stored them 
+in the response_attr column. 
+
+Let's start by taking a quick look at the data in the hstore column:
+
+```select response_attr from wikipedia limit 2;```{{execute}}
+
+Let's get all the unique values for the key "Date":
+
+```select DISTINCT response_attr -> 'Date' from wikipedia order by response_attr -> 'Date';```{{execute}}
+
+Since we only have second resolution, there are less entries for unique dates (762) than there are for wikipedia pages (3221)
+crawled.
+
+There is a [great table](https://www.postgresql.org/docs/11/hstore.html#id-1.11.7.25.5) showing all the operators and 
+functions you can use. 
+
+Now let's have some more fun, how about getting all the keys back as a set:
+
+```select skeys(response_attr) from wikipedia limit 30;```{{execute}}
+
+or just the unique keys:
+
+```select distinct skeys(response_attr) from wikipedia;```{{execute}}
+
+How about deleting an existing key-value pair from my home county of Rockland:
+
+```select response_attr as "responses" from wikipedia where county = 'Rockland County';```{{execute}}
+
+```update wikipedia set response_attr = delete(response_attr,'Server') where county = 'Rockland County';```{{execute}}
+
+```select response_attr as "responses" from wikipedia where county = 'Rockland County';```{{execute}}
+
+Finally, as a segue into our next topic, let's return a key-value row as a JSONB document
+
+```select hstore_to_jsonb(response_attr) from wikipedia limit 1;```
+
+As mentioned before, key-value columns are basically just a subset of JSONB functionality. 
+
+## Final notes
+
+I hope you started to get a good sense for the some of the possibilities you do with HSTORE data types in PostgreSQL. For 
+more examples you can always look to [this document](http://www.postgresqltutorial.com/postgresql-hstore/). 
+Again, just one more time, we recommend that use unstructured data types sparingly.  
diff --git a/appdev-wkshp/06-json-data.md b/appdev-wkshp/06-json-data.md
@@ -0,0 +1,99 @@
+# Working with JSON(B) data PostgreSQL
+
+PostgreSQL has quite advance JSON capabilites, especially with the addition of JSONB. The B in JSONB stands for binary, meaning
+the document is actually stored in binary format. This gives us a couple of advantages:
+
+1. The document takes up less space as JSONB
+2. We can index JSONB giving us all the benefits of database indices. 
+
+Just a quick reminder as a I said in the previous scenario - please consider using JSON data storage sparingly if at 
+all since it gives you the benefit of convenience but at a high price.
+
+With that, on to the JSONB
+
+## Querying JSONB
+
+We stored the reponse from Wikipedia as JSONB but it is not a very rich document structure. Even without that we can still do 
+some basic but interesting queries. 
+
+The first thing to understand that there are two major operator types - ones that return JSON and other that return text. 
+For example, this operator `_->` gets a JSON object field by a key and returns JSON:
+
+```javascript
+'{"a": "value"}'::json->'a'  = "value"
+```
+
+while the `->>` operator returns values as text. 
+
+```javascript
+'{"a": "value"}'::json->>'a'  = value
+
+```
+
+This matters because when an object is returned as JSON you can pass to another operator, thereby chaining operations. 
+
+For example our JSON has this structure (starting at the top)
+
+```javascript
+
+{
+    "batchcomplete" : true,
+    "warnings" :
+    {
+        "main":
+        {
+            "warnings":
+            "Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application."
+        }
+    ,
+        "revisions":
+        {
+            "warnings":
+            "Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."
+        }
+    }
+    ...
+```
+
+We can use this syntax to get the value of the revision warnings:
+
+```select json_content -> 'warnings' -> 'revisions' ->> 'warnings'  from wikipedia limit 1; ```{{execute}}
+
+Or we can use the JSON path operator
+
+```select json_content #>> '{warnings,revisions,warnings}' from wikipedia;```{{execute}}
+
+
+#### A more advanced query
+
+Let's pretend we don't have the state and county already in the table but we want to query for Rockland County. If we look at 
+the JSONB we see there is a normalization field that contains county names:
+
+```javascript
+{
+...   
+"query":
+    {
+        "normalized":
+        [{"fromencoded": false, "from": "Autauga_County,_Alabama", "to": "Autauga County, Alabama"}],
+        "pages":
+        [{
+        ...
+
+```
+
+So we can use that field in our where clause. Because it is in an JSON array nested deep in our structure we need to actually do
+a subquery:
+
+```
+with normalized_to AS (
+  select id, jsonb_array_elements(json_content #> '{query, normalized}') ->> 'to' as to_elements from wikipedia
+) select wikipedia.id, to_elements from wikipedia, normalized_to where normalized_to.to_elements ilike 'rockland%' AND normalized_to.id = wikipedia.id;
+
+```
+
+## Final Notes on Working with JSON in PostgreSQL
+
+Now we have seen how you can query and select different parts of your document. We didn't even cover containment or other
+fun operations. One other fun thing to keep in mind you can also create 
+indexes directly on a field in a JSONB field, which is recommended if you are going to query that field quite a bit.