Comparing DataEngine to LINQ - Data Transformation
The ComponentOne DataEngine library is equipped with out-of-the-box business intelligence capabilities, like column-oriented and memory-mapped file techniques, that accelerate data analysis and exploration. DataEngine (C1DataEngine) helps you transform data from multiple sources as it pertains to your business requirements.
It sounds impressive, but what did we do before DataEngine? Most likely a combination of LINQ, IEnumerable methods or SQL. ComponentOne DataEngine isn’t necessarily intended to make it easier to transform data (although it does in some situations). LINQ is a set of query capabilities integrated directly into the C# language and it has come a long way to making data transformation very easy. The true value of DataEngine lies in its all-in-one workspace methodology, and it’s performance improvements on top of LINQ.
In this post, we’ll look at how we perform some common data transformations using ComponentOne DataEngine compared to a common solution everyone’s familiar with: LINQ queries. We’ll look at selecting, grouping, and filtering data to solve some common data transformations.
Connecting and Selecting JSON
To start we’ll import JSON data from the public Northwind OData service into a queryable collection of objects. This step would be the same regardless of whether you plan to use LINQ, IEnumerable methods or C1DataEngine.
This code uses the Newtonsoft.Json library to parse and populate JSON data to a list of our custom Invoice objects. Similar samples can be downloaded with our Service Components, or browsed online.
// get NorthWind Invoice JSON data
WebClient wc = new WebClient();
string download = wc.DownloadString("https://services.odata.org/v4/Northwind/Northwind.svc/Invoices");
JObject root = JObject.Parse(download);
JArray values = (JArray)root["value"];
List<Invoice> collection = JsonConvert.DeserializeObject<List<Invoice>>(values.ToString());
Creating a DataEngine Workspace
When you work with LINQ or any standard data transformation techniques, you are responsible for managing each transformation and subsequent collections they produce. This means that you have to efficiently manage how the data is stored in memory.
With DataEngine, it uses a ‘workspace’ paradigm to hold and manage all of the various queries an app may require for you. This requires a few more lines of code in the beginning, but it pays off later. The workspace reserves local file space (relative to the app) for storing metadata, to help with column mapping and caching, which improves performance the more data you work with.
Here’s the code you need to set up your workspace:
string path = "workspace"; // absolute or relative path to the workspace folder
Workspace workspace = new Workspace();
workspace.Init(path);
workspace.Clear();
To connect the DataEngine to the retrieved collection of custom objects, create an instance of the ObjectConnector class and pass the Workspace object and the custom collection initialized earlier, as parameters to its constructor. The GetData method creates a DataEngine table containing the imported data.
var connector = new ObjectConnector<Invoice>(workspace, collection);
connector.GetData("Invoices");
workspace.Save();
Clearing and saving the workspace commits changes to the local stored files.
Grouping Queries with LINQ vs DataEngine
Once connected, the next task is to group the data and sum the amounts.
Using LINQ we would use the group operator and the sum extension on the fields we wish to group by and aggregate. This operation results in a new list, so we must also specify every column that we want to carry over from the full data set to the new grouped data set.
This code groups by Salesperson and sums the ExtendedPrice field using LINQ. It’s then converted to a list just so we can populate a datagrid.
var results = (from inv in collection
group inv by inv.Salesperson into g
select new Invoice
{
Salesperson = g.Key,
ExtendedPrice = g.Sum(pc => pc.ExtendedPrice)
}).ToList();
Using DataEngine we will use a similar technique; however, we will not need to define a new list of objects. Instead, we define a new view for our workspace and name it “SalesByEmployee”. This view is defined by two fields: Salesperson and Sales, which is a sum of the ExtendedPrice value.
dynamic invoices = workspace.table("Invoices");
dynamic query = workspace.query("SalesByEmployee", new
{
invoices.Salesperson,
Sales = Op.Sum(invoices.ExtendedPrice)
});
query.Query.Execute();
The usage of the Sum method in both the LINQ and C1DataEngine examples is similar, but the key difference is that with DataEngine we are defining the workspace query. That makes it very easy to define additional views. For example, let’s modify the grouping to group by each unique Salesperson and Country pairing. This will tell us the total that each person sold for each country.
Using DataEngine, we simply add a new query to the workspace, this time named “SalesByEmployeeCountry” and we simply add the Country field. The DataEngine handles the grouping logic, which is now more complex.
dynamic query = workspace.query("SalesByEmployeeCountry", new
{
invoices.Salesperson,
invoices.Country,
Sales = Op.Sum(invoices.ExtendedPrice)
});
query.Query.Execute();
To populate a datagrid with a query from DataEngine, you call the GetQueryData from the workspace, easily swapping any query you need here.
dataGridView1.DataSource = workspace.GetQueryData("SalesByEmployeeCountry");
How would we do the same using LINQ? It’s possible by defining a more complex group by an object in your LINQ query. This is where it starts to get more complicated – the syntax alone requires a reference (Google search). So in a way, DataEngine does make the complex transformation easier to code.
var results = (from inv in invoices
group inv by new { inv.Salesperson, inv.Country } into g
select new Invoice
{
Salesperson = g.Key.Salesperson,
Country = g.Key.Country,
ExtendedPrice = g.Sum(pc => pc.ExtendedPrice)
}).ToList();
dataGridView1.DataSource = results;
Filtering Queries with LINQ vs DataEngine
The next data transformation we’d like to perform on our data set is a filter, to display just the records for one specific country, the UK.
Using LINQ this is performed by adding the where operator before the group operator as seen below.
var results = (from inv in invoices
where inv.Country.Equals("UK")
group inv by new { inv.Salesperson, inv.Country } into g
select new Invoice
{
Salesperson = g.Key.Salesperson,
Country = g.Key.Country,
ExtendedPrice = g.Sum(pc => pc.ExtendedPrice)
}).ToList();
To set a filter using DataEngine, again we define a query with a new name and set the _filter member as seen below. The query will be named appropriately as “SalesByEmployeeUK”, and it also groups by Salesperson, Country and aggregates ExtendedPrice as the previous examples.
dynamic query = workspace.query("SalesByEmployeeUK", new
{
_filter = invoices.Country.Eq("UK"),
invoices.Salesperson,
invoices.Country,
Sales = Op.Sum(invoices.ExtendedPrice)
});
Note that DataEngine filtering can be implemented using the _Or_ method as well as And method of the ColumnCondition class. This allows you to build more complex filters.
For example:
_filter = invoices.Country.Eq("UK").Or().Eq("Spain")
LINQ vs DataEngine Conclusion
The examples above show how similar the code is between LINQ and DataEngine for common data transformations. The main difference is that DataEngine uses a workspace paradigm that allows you to build a collection of data views.
The DataEngine internally does the heavy lifting for you when it comes to complex transformations like grouping by multiple columns or concatenated filter expressions. And thanks to its in-memory, caching it can improve performance the larger your data becomes.