TextParser library provides TemplateBasedExtractor class to set up the Template-Based extractor that allows you to parse a plain text document following any user defined structure format.
The structure format is a template which is specified following a declarative approach that is XML. The plain text input to parse can contain many instances of the defined template. All the text that matches the specification of the template can be extracted from the input text.
This section helps you get started on how to define your custom Template-Based extractor templates.
The template to be used for text extraction is defined formally using XML elements/tags and its properties. The root of any XML template definition must be a template XML element. The extraction can be performed either by defining properties for the “template” element or by nesting the template element to define complex user-defined structures. Following are the different template structures for the text extraction process:
To extract the text using TemplateBasedExtractor class, you need to implement the steps mentioned in the code snippet below:
After defining and applying the XML template through code, the parsed result is obtained in the JSON string format which can be further used as the extracted text from the input source.