copyright | lastupdated | ||
---|---|---|---|
|
2017-12-19 |
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
This documentation is for {{site.data.keyword.knowledgestudiofull}} on {{site.data.keyword.cloud}}. To see the documentation for the previous version of {{site.data.keyword.knowledgestudioshort}} on {{site.data.keyword.IBM_notm}} Marketplace, click this link {: new_window}. {: tip}
{: #wks_tutrule_intro}
This tutorial helps you understand how to create a rule-based model that you can use to find text patterns that you define in documents. {: shortdesc}
You will build a model that can find text in documents that matches the pattern month day, year
. For example, the model would find the date reference May 1, 2010. Before you define the rule pattern itself, you will create artifacts that will help you build the pattern, including a dictionary class that recognizes month mentions and a regular expression class that recognizes year mentions in text.
After you complete this tutorial, you will know how to perform the following tasks:
- Create classes
- Add documents for defining rules
- Associate dictionaries with classes
- Define regular expressions to capture sequences of characters
- Define rules
This tutorial should take approximately 30 minutes to finish. If you explore other concepts related to this tutorial, it could take longer to complete.
- You're using a supported browser. For information, see Browser requirements.
- You successfully completed Tutorial: Creating a workspace.
- You must have at least one user ID in either the Admin or ProjectManager role. For information about user roles, see Assembling a team.
After you create the rule-based model, you can use it in one of the following ways to find text patterns in documents:
- Pre-annotate your documents before you create a machine learning model
- Deploy or export the model to other {{site.data.keyword.watson}} services or products
{: #wks_tutless_rule1}
In this lesson, you will learn how to add a dictionary to a workspace in {{site.data.keyword.knowledgestudioshort}}. The dictionary contains terms related to the months of the year.
In a later lesson, you will define a class based on this dictionary. When you create that class, all terms in this dictionary that are found in documents will be automatically annotated as a mention of the associated class type. For more information about dictionaries, see Adding dictionaries to a workspace.
-
Download the
dictionary-items-month.csv
file to your computer. This file contains dictionary terms in CSV format, suitable for importing into a {{site.data.keyword.knowledgestudioshort}} dictionary. -
From the Assets & Tools > Pre-Annotators sidebar, select the Dictionaries tab, and click Manage Dictionaries.
-
Click the Create Dictionary button to add a dictionary.
-
In the Name field, type
Month dictionary
and click Save to create the (empty) dictionary. The new dictionary is created and automatically opened for editing. -
In the dictionary pane, click Upload.
-
In the Upload Dictionary Entries window, select the
dictionary-items-month.csv
file from your computer and then click Upload.The terms in the file are imported into the dictionary.
{: #wks_tutless_rule2}
In this lesson, you will learn how to add documents with linguistic patterns that illustrate the types of rules you want to define.
For more information about adding documents, see Adding documents for defining rules.
-
Download the
documents-new.csv
file to your computer. This file contains example documents suitable for importing. -
From the sidebar, click Document Annotation > Rules.
-
Click the Add a document icon next to Documents.
-
Click the Import CSV file tab.
-
Click to browse for the
documents-new.csv
file that you downloaded to your computer earlier, and then click Upload.A set of documents is displayed in the main Documents page.
{: #wks_tutless_rule3}
In this lesson, you will learn how to define classes that you will use when you define a rule.
For more information about classes, see Rules.
-
From the Rules page of your workspace, click the Add a class icon next to Class in the right side panel.
-
Enter
DictMonth
as the class name, and then click Add.The new class is displayed in the Class side panel.
{: #wks_tutless_rule4}
In this lesson, you will learn how to use a dictionary in the rule editor.
-
From the sidebar, select Document Annotation > Dictionaries, and then click the Month dictionary that you created previously.
-
From the Class list, select
DictMonth
and then click Save.The class is associated with the dictionary.
For documents that are associated with the rule editor, any references to terms in the dictionary are annotated as DictMonth
class mentions. You will see proof that these references have been annotated in the next lesson.
{: #wks_tutless_rule5}
In this lesson, you will learn how to find class annotations in rule editor documents.
-
From the sidebar, select Document Annotation > Rules.
-
From the Class panel, find the
DictMonth
class that you defined earlier, and click the Search annotations in documents icon that's next to it.The Find Annotations page is displayed and shows all the documents that contain text references to months.
-
Click the
Technology - computerworld.com
document to view the full document. Notice that the textFebruary
is highlighted, which means it was annotated as a mention of theDictMonth
class.
{: #wks_tutless_rule6}
In this lesson, you will learn how to define a regular expression.
You will define a regular expression that can find year patterns like 2009.
For more information about defining regular expressions, see Defining a rule.
-
From the Rules page, click the Add a class icon () next to Class from the right side panel.
-
Enter
RegExpYear
as the class name, and click Add. -
From the sidebar, click Regex, and then click the Create a regular expression icon next to Regular Expressions.
-
Click the Add Entry button.
-
In the Regular Expression field, enter the following expression:
(?:(?:19|20)[0-9]{2})
{: screen}
Note: This regular expression finds years between 1900 and 2099.
-
Set Minimum Word Tokens to
1
and Maximum Word Tokens to1
. -
Click Add to save the regular expression entry.
-
Enter
MyYearExp
as the regular expression name, and then, from the Class menu, select the RegExpYear class that you defined earlier. -
Click Save.
After you save the regular expression, it is automatically applied to the sample documents. Any text strings that follow the pattern that you defined in the regular expression are annotated as mentions of the RegExpYear class.
-
To check whether the expression you defined is capturing time occurrences correctly, you can search for mentions. Click the Search annotations in documents icon next to the RegExpYear class in the Class side panel.
The Find Annotations page is displayed. Occurrences of year mentions are highlighted in the sample documents in which they occur.
{: #unique_1166829415}
In this lesson, you will learn how to define a rule.
You already defined a dictionary-based class for annotating month mentions. You also defined a regular expression that finds numeric values which represent a year. Now, you will define a rule that captures the sequence of a month followed by a number, a comma, and then a year. You will define a rule for date expressions like September 21, 2016.
For more information about defining rules, see Defining a rule.
-
From the sidebar, select Document Annotation > Rules, and open the
Technology - computerworld.com
document. -
Select the text February 3, 2009 in the document. Make sure you select the comma, too.
-
Click the Add a rule icon.
The rule editor shows a depiction of the rule pattern that you identified.
The text February 3, 2009 is visible. A gray line that connects the cells in the depiction identifies which cells are currently part of the pattern.
- The DictMonth class is part of the rule pattern instead of the text February. This selection is preferred because you want the model to find any month that is annotated by the DictMonth class as the first token in the date pattern instead of the text February only.
- At the end of the rule, the year 2009 is already annotated as being a mention of the RegExpYear class. The RegExpYear class is part of the rule pattern instead of the number 2009. This selection is also preferred because you want the model to find any year that is annotated by the RegExpYear class as the last token in the date pattern instead of the specific text 2009 only.
The number 3 and the comma (,) after it are shown as the second and third tokens in the pattern. As the pattern is currently specified, the model will find only occurrences of dates that specify the 3rd day of a month. We want the model to find dates that specify any day of the month, so next we will change the feature settings for the day token.
-
Above the day
3
cell, click the Text icon to open the feature settings for the token.Currently, the rule is set to match the exact text,
3
. Instead, we want it to match any number. -
Change the feature setting to be numeric by selecting Character Type : Numeric, and then deselecting Text : 3.
You changed the definition for the number
3
cell.The Aa icon indicates that instead of requiring the number to be equal to 3 exactly, it can be any number.
-
Do not change any settings for the comma token.
We want the third token in the pattern to be a comma, so the current feature setting of text : , is appropriate. In addition to a feature setting, each token has a repeat setting. The repeat setting specifies how many times the token can be repeated in the text for it to match the pattern. The current repeat setting of Required (Exactly 1) is also appropriate as specified.
-
Assign a class to represent the pattern
DictMonth + numeric token + comma + RegExpYear
.Notice the four empty cells that represent the four tokens that you selected from the document. To select all the cells, select the first cell, and then press Shift + click each additional cell. Enter
RuleDate
as the class name, and then click it to create the new class.You have successfully defined the pattern for the rule.
-
In the Rule name field, enter
MyDateRule
and click Save.After you save the rule, it is automatically applied to the sample documents. If the
Technology - computerworld.com
document is still open in the rule editor, you will see that theFebruary 3, 2009
text in the document is now annotated as a mention of the RuleDate class.You can search for all occurrences of RuleDate class mentions in the sample documents by clicking the Search annotation in documents icon ( ) next to the
RuleDate
class from the Class panel. It is a good practice to check whether all dates are being captured properly to confirm that you defined the pattern correctly.
{: #wks_tutless_rule8}
In this lesson, you will learn how to create a rule annotator.
For more information about creating a rule annotator, see Creating the rule-based model.
-
From the sidebar, select Model Management > Versions and click the Rule-based model type mapping tab.
-
Map the
RuleDate
class that you defined corresponding to theDATE
entity from the type system. -
To run the rule-based model select the Rule-based tab and click Run this model.
{: #wks_tutrule_sum}
While learning about {{site.data.keyword.knowledgestudioshort}}, you created a rule-based annotator.
By completing this tutorial, you learned about the following concepts:
- Classes
- Regular expressions
- Rules