[]
        
(Showing Draft Content)

Simple Template

Consider a scenario where we want to extract all the email addresses that appear in a text file. In this case, we can use a simple template that would refer to a single template element defining a set of properties to set text extraction criteria.

For the above scenario, we can simply use the following template definition:


<template extractFormat="email" />


The above template definition consists of extractFormat property for element “template”. The extractFormat property expresses the format of the intended extract text for a specific template element and enables text extraction based on the data format..


Similarly, we can use the following extractFormat property values to extract different type of textual data:

  • int - This format can be used to extract integers from a given text input.

    Check Example

    Input

    Template Definition

    Output (JSON)

    There are 8 items in order number 11002345.

    <template name="simpleTemplate" extractFormat="int"/>

    { "Extractor": "XMLTemplateBased", "Result": { "simpleTemplate": [ 8, 11002345 ] } }

  • bool - This format can be used to extract the boolean values that is "true" or "false" from a given text input.

    Check Example

    Input

    Template Definition

    Output (JSON)

    There are 8 items in order number 11002345 and they will be true to their description. Please let us know in case you find any false information in the products description.

    <template name="simpleTemplate" extractFormat="bool" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleTemplate": [ true, false ]}}

  • float - This format can be used to extract floating point numbers from a given text input.

    Check Example

    Input

    Template Definition

    Output (JSON)

    There are 8 items in order number 11002345. The price of each item in the order is greater than Rs. 100.00.

    <template name="simpleTemplate" extractFormat="float" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleTemplate": [ 8.0, 11002345.0, 100.0 ] } }

  • email - This format can be used to extract emails from a given text input.

    Check Example

    Input

    Template Definition

    Output (JSON)

    The order placed by "Armor Cathe" is successful. There are 8 items in order number 11002345. The price of each item in the order is greater than Rs. 100.00. The order has to be delievered at pin code 0012345. For order details visit http://orderAtEase.com or refer to the registered email address armor.cathe@gamil.com.

    <template name="simpleEmailTemplate" extractFormat="email" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleEmailTemplate": [ "armor.cathe@gamil.com" ] } }

  • url - This format can be used to extract all URLs (address of a World Wide Web page) from a given text input.

    Check Example

    Input

    Template Definition

    Output (JSON)

    There are 8 items in order number 11002345. The price of each item in the order is greater than Rs. 100.00. For order details visit http://orderAtEase.com.

    <template name="simpleTemplate" extractFormat="url" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleTemplate": [ "100.00.", "http://orderAtEase.com." ] } }

  • quotedString - This format can be used to extract a sequence of characters that appear between double quotes.

    Check Example

    Input

    Template Definition

    Output (JSON)

    The order placed by "Armor Cathe" is successful. There are 8 items in order number 11002345. The price of each item in the order is greater than Rs. 100.00. For order details visit http://orderAtEase.com.

    <template name="simpleTemplate" extractFormat="quotedString" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleTemplate": [ "Armor Cathe" ] } }

  • word - This format is used to extract a single literal word (having only a-z and A-Z characters).

    Check Example

    Input

    Template Definition

    Output (JSON)

    The 8 orders placed by "Armor Cathe" are successful.

    <template name="simpleTemplate" extractFormat="word" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleTemplate": [ "The", "order", "placed", "by", "Armor", "Cathe", "are", "successful", ] } }

  • whiteSpaces - This format is used to extract one or more occurrences of a white space (tab, new line, or space).

    Check Example

    Template Definition

    <template extractFormat="whiteSpaces" />

  • regex - This format is used when the above defined formats cannot be used. To use this format, you need to simply specify a regular expression that matches the data to be extracted. For example, the following template definition can be used to extract a sequence of 7 digits from the input text.

    Check Example

    Input

    Template Definition

    Output (JSON)

    The order placed by "Armor Cathe" is successful. There are 8 items in order number 11002345. The price of each item in the order is greater than Rs. 100.00. The order has to be delievered at pin code 0012345. For order details visit http://orderAtEase.com.

    To extract a sequence of 7 digits <template name="simpleEmailTemplate" extractFormat="regex:[0-9]{7} | " />

Additional properties for text extraction

You can set the following properties on the template XML element to enhance text extraction:

  • Name - This property is mandatory to be used for specifying a name for the template elements. Naming each template element helps quickly recognize the extracted element. It is necessary to provide a name to a template element when we want to inject the template element into another.

    Check Example

    Input

    Template Definition

    Output (JSON)

    This is my working email: alexander.silva@grapecity.com and my private email is: alexsilva050@gmail.com. Please feel free to contact me also with silva050alexander@gmail.com.

    <template name="simpleEmailTemplate" extractFormat="email" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleEmailTemplate": [ "alexander.silva@grapecity.com", "alexsilva050@gmail.com", "silva050alexander@gmail.com" ] } }

  • startingRegex -This property represents a regular expression that must match at the beginning of a possible instance of the template element.

    Check Example

    Input

    Template Definition

    Output (JSON)

    This is my working email: alexander.silva@grapecity.com and my private email is: alexsilva050@gmail.com. Please feel free to contact me also with silva050alexander@gmail.com.

    To extract working email address: <template name="simpleEmailTemplate" startingRegex="working email:" extractFormat="email" />

    { "Extractor": "XMLTemplateBased", "Result": { "simpleEmailTemplate": [ "alexander.silva@grapecity.com" ] } }

  • endingRegex - This property represents a regular expression that must match at the end of a possible instance of the template element.

    Check Example

    Input

    Template Definition

    Output (JSON)

    This is my working email: alexander.silva@grapecity.com and my private email is: alexsilva050@gmail.com. Please feel free to contact me also with silva050alexander@gmail.com.

    To extract the email addresses that appear at the end of the sentence. <template name="emails" endingRegex="[.]" extractFormat="email" />

    { "Extractor": "XMLTemplateBased", "Result": { "emails": [ "alexsilva050@gmail.com", "silva050alexander@gmail.com" ] } }

  • ignoreWhitespaces - This property is used to specify whether the template-based extractor should ignore all the white spaces when parsing an input source or not. The default value of this property is “true” so the template-based extractor ignores all the white spaces when parsing an input source. However, to allow parsing of text while considering the whitespaces, set the value of this property to "false".

    Check Example

    Input

    Template Definition

    Output (JSON)

    This is my working email: alexander.silva@grapecity.com and my private email is: alexsilva050@gmail.com. Please feel free to contact me also with silva050alexander@gmail.com.

    To extract working email address: <template name="simpleEmailTemplate" startingRegex="working email:" extractFormat="email" ignoreWhiteSpaces="false" />

    { "Extractor": "XMLTemplateBased", "Result": {} }