-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
468 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,91 @@ | ||
# Full Text Search in PostgreSQL | ||
|
||
Build the index in the class and use ts_vector when building the index | ||
https://www.postgresql.org/docs/11/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX | ||
While most of you probably know of the SQL search you can do on text with LIKE or ILIKE, did you know that PostgreSQL has | ||
a full text analysis program built right in. The search capabilities are similar to those available in Lucene and | ||
its derivative such as ElasticSearch and SOLR. In this exercise we are going to show you how to setup and utilize | ||
[Full Text Search](https://www.postgresql.org/docs/11/textsearch.html) (FTS). | ||
|
||
https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2116/slides/137/pgconf.eu-2018-fts.pdf | ||
#### Basic ideas in FTS | ||
|
||
There is a [detailed discussion] (https://www.postgresql.org/docs/11/textsearch-intro.html) in the documentation about the | ||
concepts in FTS. For now let's just focus on the steps (along with simplify the actual work). | ||
|
||
1. The first step is to take the field(s) which have the content (a document) and analyze the text into words and phrases along with | ||
positions in the original text. | ||
* One piece of this analysis is called a tokenizer, which identifies words. numbers, urls... in the original text | ||
* The other piece converts these tokens into "lexemes". A lexeme is a normalized words and you need a dictionary to | ||
process for valid lexemes | ||
2. Then you store these store lexemes and their positions either in a column or an index | ||
3. Now you use a search function that understands lexemes and positions in the original document to carry out the search. | ||
This search function **must** use the same dictionary that was used to create the lexemes. | ||
|
||
And with that **very** basic introduction let's get to it. We are going to do a FTS on the event narratives in the Storm | ||
Events details table. | ||
|
||
#### Build the index | ||
|
||
Building a FTS index is actually quite simple: | ||
|
||
```CREATE INDEX se_details_fulltext_idx ON se_details USING GIN (to_tsvector('english', event_narrative));```{{execute}} | ||
|
||
This function will take a little while to run as it tokenizes and lexemes all the content in the column. | ||
The syntax is basically the same as creating any GIN index. The only difference is that we use the to_tsvector function, | ||
passing in the dictionary to use, 'english', and the field to analyze. | ||
|
||
If you want to see the other default dictionaries PostgreSQL includes by default just query for it: | ||
|
||
` \dF `{{execute}} | ||
|
||
Quick note before we use that nice shiny index, there are actual two way to store the results of the text analyzer, | ||
either in an index or a in a separate column. | ||
While there is a full discussion of the tradeoffs in the [documentation](https://www.postgresql.org/docs/11/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX) | ||
it boils down to: | ||
1. With an index, when data is updated or inserted the index will automatically analyze it. On the downside, you, the | ||
application developer, need to know the dictionary that was used and still use the analysis function in your query. | ||
2. With a column, your SQL syntax is cleaner and your performance with large indices will be better. The downside is you | ||
need to write a trigger to update the processed column anytime there is a change to the original document columns. | ||
|
||
Today, for simplicity we chose to just use the index approach. If you do end up using a FTS we recommend you do some reading on the | ||
solution that works best for you. | ||
|
||
#### Using the index | ||
|
||
If we want to do a full text search we can now do something like this: | ||
|
||
```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('villag');```{{execute}} | ||
|
||
You will notice in the result set we are getting village and villages. To_tsquery is a basic search parser. | ||
This query also allows us to use the _:*_ operator and get the search to do full stemming after the end of the word | ||
|
||
```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('english', 'villa:*');```{{execute}} | ||
|
||
We can also now look for phrases such as words that appear close together in the document. Let's look for some big hail: | ||
|
||
```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('grapefruit <1> hail');```{{execute}} | ||
|
||
The *<1>* operator in this case tells the search to look for the words grapefruit and hail next to each other in the document. | ||
As expected this return no results. But if we now change the distance between the words to allow for an intervening word | ||
we start to get what we expect: | ||
|
||
```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('grapefruit <2> hail');```{{execute}} | ||
|
||
The order of the words using the <N> operator is order sensitive. Swapping grapefruit and hail we again get no results: | ||
|
||
```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('hail <2> grapefruit');```{{execute}} | ||
|
||
You can also use | (OR) and ! (NOT) operators inside the to_tsquery(). Once you start writing more complicated search phrasings | ||
you should start to use parentheses to group search together. | ||
|
||
The following search will find all documents with grapefruit OR the prefix golf with the word hail two words later. | ||
|
||
```select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('(grapefruit | golf:* ) <2> hail');``` | ||
|
||
## Final notes on full text search | ||
|
||
As you can see, we can do powerful and fast full text searching with FTS in PostgreSQL. Unfortunately, the documentation | ||
on this feature is actually quite sparse and difficult to interpret. If you want to learn this syntax you are going to have to dig in with debugging | ||
and trying different query techniques to get your desired results. | ||
One other [helpful document](https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2116/slides/137/pgconf.eu-2018-fts.pdf) is a presentation by one of the lead developers. | ||
Hopefully in the future the PostgreSQL community will update and improve this documentation. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Storing and querying key-value pairs in a column | ||
|
||
Sometimes the data you want to work with comes as key-value pairs and all the keys are not well defined before-hand. If you | ||
are a Java programmer this sounds like a Map (such as HashMap) and if you are a Python programmer this sounds like a Dictionary. | ||
Postgresql allows you to handle this data with an extension named HSTORE. This extension allows for storing arbitrary number | ||
of key-value pairs in a column along with operators to search the key-value pairs. The number of key-value pairs can also | ||
vary between different rows. | ||
|
||
## Two quick notes | ||
|
||
While we are going to work with HSTORE today, you should realize that most of this functionality is superceded by the JSONB | ||
data type and there is not much ongoing work on HSTORE. A column with arbitrary key-value pairs can also be modeled as a | ||
flat JSON document but if, for some reason you don't want to use JSON, feel free to use HSTORE. | ||
|
||
Second, while it is very convenient to dump arbitrary key-value pairs (or JSON for that matter) into a column, we believe this | ||
pattern of handling data should only be used in limited cases. Using these data types for most of your data has several | ||
potential drawbacks | ||
|
||
1. JSON and Key-Value data storage can be order of magnitude times larger given that you are repeating the attributes or keys | ||
for every row. And while disks are cheap, retrieving more data from disk will always be a performance penalty. Indexing more | ||
data will always be more expensive as well. | ||
1. You lose the ability of the database to "enforce" that the proper data type, such as integer or float, is being stored | ||
in the database | ||
1. You lose the ability to have the database keep track and manage relations between different data. This can quickly lead | ||
to data orphans or data being out of sync | ||
|
||
|
||
Make sure the benefits of using these datatypes outweighs the cost before using them. | ||
|
||
Our recommended pattern is to process the JSON or Key-Value pairs and, as much as possible, store them in a well defined | ||
schema. Again you are free to do what you want, but this is our recommendation. | ||
|
||
#### Querying Key-Value data | ||
We already inserted key-value data into the wikipedia table. We took the response headers from the web request and stored them | ||
in the response_attr column. | ||
|
||
Let's start by taking a quick look at the data in the hstore column: | ||
|
||
```select response_attr from wikipedia limit 2;```{{execute}} | ||
|
||
Let's get all the unique values for the key "Date": | ||
|
||
```select DISTINCT response_attr -> 'Date' from wikipedia order by response_attr -> 'Date';```{{execute}} | ||
|
||
Since we only have second resolution, there are less entries for unique dates (762) than there are for wikipedia pages (3221) | ||
crawled. | ||
|
||
There is a [great table](https://www.postgresql.org/docs/11/hstore.html#id-1.11.7.25.5) showing all the operators and | ||
functions you can use. | ||
|
||
Now let's have some more fun, how about getting all the keys back as a set: | ||
|
||
```select skeys(response_attr) from wikipedia limit 30;```{{execute}} | ||
|
||
or just the unique keys: | ||
|
||
```select distinct skeys(response_attr) from wikipedia;```{{execute}} | ||
|
||
How about deleting an existing key-value pair from my home county of Rockland: | ||
|
||
```select response_attr as "responses" from wikipedia where county = 'Rockland County';```{{execute}} | ||
|
||
```update wikipedia set response_attr = delete(response_attr,'Server') where county = 'Rockland County';```{{execute}} | ||
|
||
```select response_attr as "responses" from wikipedia where county = 'Rockland County';```{{execute}} | ||
|
||
Finally, as a segue into our next topic, let's return a key-value row as a JSONB document | ||
|
||
```select hstore_to_jsonb(response_attr) from wikipedia limit 1;``` | ||
|
||
As mentioned before, key-value columns are basically just a subset of JSONB functionality. | ||
|
||
## Final notes | ||
|
||
I hope you started to get a good sense for the some of the possibilities you do with HSTORE data types in PostgreSQL. For | ||
more examples you can always look to [this document](http://www.postgresqltutorial.com/postgresql-hstore/). | ||
Again, just one more time, we recommend that use unstructured data types sparingly. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Working with JSON(B) data PostgreSQL | ||
|
||
PostgreSQL has quite advance JSON capabilites, especially with the addition of JSONB. The B in JSONB stands for binary, meaning | ||
the document is actually stored in binary format. This gives us a couple of advantages: | ||
|
||
1. The document takes up less space as JSONB | ||
2. We can index JSONB giving us all the benefits of database indices. | ||
|
||
Just a quick reminder as a I said in the previous scenario - please consider using JSON data storage sparingly if at | ||
all since it gives you the benefit of convenience but at a high price. | ||
|
||
With that, on to the JSONB | ||
|
||
## Querying JSONB | ||
|
||
We stored the reponse from Wikipedia as JSONB but it is not a very rich document structure. Even without that we can still do | ||
some basic but interesting queries. | ||
|
||
The first thing to understand that there are two major operator types - ones that return JSON and other that return text. | ||
For example, this operator `_->` gets a JSON object field by a key and returns JSON: | ||
|
||
```javascript | ||
'{"a": "value"}'::json->'a' = "value" | ||
``` | ||
|
||
while the `->>` operator returns values as text. | ||
|
||
```javascript | ||
'{"a": "value"}'::json->>'a' = value | ||
|
||
``` | ||
|
||
This matters because when an object is returned as JSON you can pass to another operator, thereby chaining operations. | ||
|
||
For example our JSON has this structure (starting at the top) | ||
|
||
```javascript | ||
|
||
{ | ||
"batchcomplete" : true, | ||
"warnings" : | ||
{ | ||
"main": | ||
{ | ||
"warnings": | ||
"Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application." | ||
} | ||
, | ||
"revisions": | ||
{ | ||
"warnings": | ||
"Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used." | ||
} | ||
} | ||
... | ||
``` | ||
We can use this syntax to get the value of the revision warnings: | ||
```select json_content -> 'warnings' -> 'revisions' ->> 'warnings' from wikipedia limit 1; ```{{execute}} | ||
Or we can use the JSON path operator | ||
```select json_content #>> '{warnings,revisions,warnings}' from wikipedia;```{{execute}} | ||
#### A more advanced query | ||
Let's pretend we don't have the state and county already in the table but we want to query for Rockland County. If we look at | ||
the JSONB we see there is a normalization field that contains county names: | ||
```javascript | ||
{ | ||
... | ||
"query": | ||
{ | ||
"normalized": | ||
[{"fromencoded": false, "from": "Autauga_County,_Alabama", "to": "Autauga County, Alabama"}], | ||
"pages": | ||
[{ | ||
... | ||
|
||
``` | ||
So we can use that field in our where clause. Because it is in an JSON array nested deep in our structure we need to actually do | ||
a subquery: | ||
``` | ||
with normalized_to AS ( | ||
select id, jsonb_array_elements(json_content #> '{query, normalized}') ->> 'to' as to_elements from wikipedia | ||
) select wikipedia.id, to_elements from wikipedia, normalized_to where normalized_to.to_elements ilike 'rockland%' AND normalized_to.id = wikipedia.id; | ||
|
||
``` | ||
## Final Notes on Working with JSON in PostgreSQL | ||
Now we have seen how you can query and select different parts of your document. We didn't even cover containment or other | ||
fun operations. One other fun thing to keep in mind you can also create | ||
indexes directly on a field in a JSONB field, which is recommended if you are going to query that field quite a bit. |
Oops, something went wrong.