Extract Product Information from Amazon in .NET
Over the years, e-commerce has seen tremendous growth as an industry. A user can purchase almost anything and everything across the many available e-commerce platforms. While customer satisfaction is rising with these many options, another important element is rising exponentially, i.e., Data.
These e-commerce sites contain tons of data such as products available, price, availability, ratings, customer reviews, etc. Collection and analysis of this data can play a vital role in your success story whether you are an e-commerce competitor, a product comparison portal, a market research firm, etc. How can this data be collected?
The simplest way to do this is via manual collection. You can visit a page, note down the information needed, probably in an excel file, and then repeat the same for other pages. This obviously is cumbersome and not a feasible solution.
This blog will look at creating an easier solution to extract data from e-commerce sites using the C1TextParser library with amazon.com as an example. C1TextParser is a .NET Standard library that can extract data in a structured format from various semi-structured sources such as emails, webpages, text files, etc.
Install C1TextParser
The simplest way to install C1TextParser is via NuGet. Search for the “C1.TextParser” package and install the same.
C1TextParser can also be downloaded and installed through the service components tile in the ComponentOne Control Panel. Downloading from this installer also gives you access to samples and other components.
Define the HTML Extractor
C1TextParser provides 3 different extractors, namely, Starts-After-Continues-Until, HTML, and Template-based. For this blog, we will use the HtmlExtractor. You can read about the working of HtmlExtractor from this blog.
To create and use the HtmlExtractor, we primarily need 3 things:
- A template stream: The stream representing HTML content to be used as a template. The placeholders are declared relative to this.
- Placeholders: These provide the XPath to the HTML elements from which the data needs to be extracted. These can be fixed/repeated.
- Input Stream: The text stream to the pages from which the data needs to be extracted.
For example, the following shows part of a web page used as a template to extract price information of products on amazon.com.
And the following code shows how to set up HtmlExtractor to extract this info from the webpage:
var response = await Client.GetAsync(productUri);
var page = await response.Content.ReadAsStreamAsync();
var extractor = new HtmlExtractor(page);
extractor.AddPlaceHolder("price", @"//*[@id=""priceblock_ourprice""]");
var extractionResult = extractor.Extract(page);
var json = extractionResult.ToJsonString();
To extract other data such as product name, ratings, etc., from the webpage, we can keep adding more placeholders using the AddPlaceHolder method and then perform the extraction to get the desired results. This, however, is not a full-proof solution, as we will see in the next section.
The XPathPool
The approach shown above works completely fine if all our webpages follow the same structure as the defined template, i.e., the web elements in all the webpages are at the same XPath as defined by the template placeholders.
However, with modern e-commerce sites, this cannot be guaranteed. On these sites, the positioning of elements is dynamic in nature. There can always be an additional HTML row before our placeholder element, thus making the XPath in the placeholder invalid.
This problem can be solved if the XPath for the placeholders can be defined dynamically. To assist in this dynamic selection of XPaths, we introduce the notion of XPathPool to this blog. An XPathPool, as the name suggests, is a pool of XPaths collected from multiple web pages.
The following shows the IXPathPool interface:
public interface IXPathPool
{
IEnumerable<XPathObject> GetBestMatchingXPaths(string html);
void AddXPathToPool(XPathObject xPathObj);
void AddXPathsToPool(IEnumerable<XPathObject> xPathObjs);
void Clear();
}
This shows how the above xpaths are added to the pool:
var _xpathPool = new XPathPool();
_xpathPool.AddXPathToPool(new XPathObject("ProductColor", @"//*[@id=""poExpander""]/div[1]/div/table/tbody/tr[1]/td[2]/span"));
_xpathPool.AddXPathToPool(new XPathObject("ProductColor", @"//*[@id=""productOverview_feature_div""]/div/table/tbody/tr[4]/td[2]/span"));
Next, we'll look at how with some validations, the XPathPool selects the best valid XPaths that can dynamically define the placeholders.
The XPathObject and XPathMatchValidator
If we observe, the data to extract always has an identifier text which tells what the data is about. For example, the text ‘Color’ on the pages below identifies that the adjacent text gives the product's color.
So, for all XPaths that form the XPathPool, the following info gets saved:
- XPath of the web element that contains data
- Name that will be used for the above XPath in the template placeholder
- Identifier
- XPath of the web element that contains the identifier
- Possible text values that the identifier XPath can contain
public class XPathMatchValidator : BaseXPathObject
{
public IEnumerable<string> ItemsSource { get; private set; }
public XPathMatchValidator(string xPath, string validatorText)
{
XPath = xPath;
ItemsSource = new List<string> { validatorText};
}
public bool Validate(HtmlDocument doc)
{
HtmlNode node = doc.DocumentNode.SelectSingleNode(XPath);
if (node == null)
return false;
var innerText = node.InnerText.Trim(' ', ':');
return ItemsSource.Contains(innerText);
}
}
public class XPathObject : BaseXPathObject
{
public XPathMatchValidator Validator { get; private set; }
public bool HasValidator => Validator != null;
}
With the XPathMatchValidator, the XPathPool initialization changes to:
var _xpathPool = new XPathPool();
var xPathObj1 = new XPathObject("ProductColor", @"//*[@id=""poExpander""]/div[1]/div/table/tbody/tr[1]/td[2]/span")
{
Validator = new XPathMatchValidator(@"//*[@id=""poExpander""]/div[1]/div/table/tbody/tr[1]/td[1]/span", "Color")
};
var xPathObj2 = new XPathObject("ProductColor", @"//*[@id=""productOverview_feature_div""]/div/table/tbody/tr[4]/td[2]/span")
{
Validator = new XPathMatchValidator(@"//*[@id=""productOverview_feature_div""]/div/table/tbody/tr[4]/td[1]/span", "Color")
};
_xpathPool.AddXPathToPool(xPathObj1);
_xpathPool.AddXPathToPool(xPathObj2);
The following code demonstrates how the XPathPool uses the above elements to select the XPaths dynamically using the validators:
public IEnumerable<XPathObject> GetBestMatchingXPaths(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (var pair in _xPaths)
{
foreach (var xPathObj in pair.Value)
{
if (!xPathObj.HasValidator && doc.IsValidXPath(xPathObj.XPath))
{
yield return xPathObj;
break;
}
if (xPathObj.HasValidator && xPathObj.Validator.Validate(doc))
{
yield return xPathObj;
break;
}
}
}
doc.DocumentNode.ChildNodes.Clear();
}
Bringing It All Together
Now that all the hard work is done, defining the HtmlExtractor is as simple as the following:
var xPathObjects = _xPathPool.GetBestMatchingXPaths(html);
var stream = new MemoryStream(Encoding.ASCII.GetBytes(html));
var extractor = new HtmlExtractor(stream);
foreach(var xPath in xPathObjects)
{
extractor.AddPlaceHolder(xPath.Name, xPath.XPath);
}
var result = extractor.Extract(actualPage);
The complete code for the above sample can be found at the end of this article. You can learn more about TextParser from the product documentation.
Download the HtmlParsingDemo sample here.