How to Read and Extract Table Data from PDF Using C# .NET
Quick Start Guide | |
---|---|
What You Will Need |
Visual Studio .NET Core Console App NuGet Packages:
|
Controls Referenced |
Document Solutions for PDF - A C# .NET PDF document API library allowing developers to programmatically create and manipulate PDF documents at scale. Document Solutions for Excel, .NET Edition - A high-speed C# .NET Excel spreadsheet API library, allows you to programmatically create, edit, import, and export Excel with no dependencies on MS Excel. |
Tutorial Concept | Learn how use a .NET PDF API in order to access Table(s) in PDF files and extract tabular data for export to CSV files or other formats, such as XLSX, as needed. |
Manual PDF table extraction? No, thanks!
The Document Solutions for PDF .NET library (DsPdf) automates this tedious task, saving developers countless hours and ensuring data accuracy. It unlocks valuable information trapped within PDFs for analysis in CSV, XLSX, or other formats.
This article empowers developers to leverage DsPdf by using C# to extract PDF table data or for seamless text extraction from PDFs. If this is new to you and you would like to follow along, you can download Document Solutions for PDF or its associated NuGet packages. Before we jump in, let’s review some background.
Ready to Get Started? Download Document Solutions for PDF Today!
Important Information About Tables in PDF Documents
While PDFs are a popular way to present data in tables, it's important to remember that these tables aren't a built-in feature of the PDF file format. Instead, they're created by arranging text and shapes to look like tables.
Unlike tables in programs like Excel or Word, these PDF "tables" lack a deeper structure; they don't feature any code that defines rows, columns, or cells. So, how can we process this data? Let's see how DsPdf's API for a C# PDF library can help us work with this information!
How to Extract Table Data from PDF Documents Programmatically Using C#
- Create a .NET Core Console Application with DsPdf Included
- Load the Sample PDF that Contains a Data Table
- Define Table Recognition Parameters
- Get the PDF's Table Data
- Save Extracted PDF Table Data to Another File Type (CSV)
- BONUS: How to Extract Data from a Multi-Page Table using C#
- BONUS: Format the Exported PDF Table Data in an Excel File (XLSX)
You can download the finished sample application if you’d like to see the completed application rather than following along. If you don’t want to download the sample, you can simply view a similar demo featuring full code on our website.
Create a .NET Core Console Application with DsPdf Included
First, you’ll need to create a .NET Core Console application, right-click 'Dependencies,' and select 'Manage NuGet Packages'. Under the 'Browse' tab, search for 'DS.Documents.Pdf' and click 'Install,' as shown in the screenshot below:
Be sure to accept the “License Acceptance” dialog when installing so the required .dll files will load correctly:
In the Program.cs file, add the following statements:
using System.Text;
using GrapeCity.Documents.Pdf;
using GrapeCity.Documents.Pdf.Recognition;
using System.Linq;
The necessary classes we will be using live in the GrapeCity.Documents.* namespaces, hence the inclusion of the above statements in our application.
Load the Sample PDF that Contains a Data Table
Next, we’ll use the GcPdfDocument constructor to create and load a new PDF document to be parsed. This will invoke the GcPdfDocument’s Load method to load the PDF document that contains the data table to be parsed:
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
// Initialize GcPdf document object
var pdfDoc= new GcPdfDocument();
// Load a PDF document
pdfDoc.Load(fs);
}
In this example, we’re using a PDF with the following table:
Define Table Recognition Parameters
Next, we’ll need to instantiate a new instance of the standard RectangleF class, which will be used to define the table bounds in the PDF document. We also set a DPI value of 72 dots per inch, which is the default PDF page resolution, to be used for our measurements:
const float DPI = 72;
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
// Initialize GcPdf document object
var pdfDoc= new GcPdfDocument();
// Load a PDF document
pdfDoc.Load(fs);
// The approximate table bounds:
var tableBounds = new RectangleF(0, 2.5f * DPI, 8.5f * DPI, 3.75f * DPI);
}
To help with table recognition throughout the defined page parameters, we’ll also use the TableExtractOptions class, which allows us to fine-tune various aspects of the table formatting, such as the column width, row height, and distance between the rows or columns:
// TableExtractOptions allows us to fine-tune table recognition accounting for
// the specifics of the table formatting:
var tableExtrctOpt = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = tableExtrctOpt.GetMinimumDistanceBetweenRows;
// In this particular case, we slightly increase the minimum distance between rows to make sure cells with wrapped text are not mistaken for two cells:
tableExtrctOpt.GetMinimumDistanceBetweenRows = (list) =>
{
var res = GetMinimumDistanceBetweenRows(list);
return res * 1.2f;
};
Get the PDF’s Table Data
Now that we have our PDF document object initialized and loaded with the proper bounds and options set, we’ll finally extract our table data from the PDF through DsPdf’s GetTable method, which returns an object of type ITable. We’ll also add the logic to access each cell in the extracted table by using the ITable interface’s GetCell(rowIndex, colIndex) method. We’ll utilize ITable’s Rows.Count and Cols.Count properties to loop through the extracted table data, adding any data we find to a list of type String:
for (int i = 0; i < pdfDoc.Pages.Count; ++i)
{
// Get the table at the specified bounds:
var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
if (itable != null)
{
for (int row = 0; row < itable.Rows.Count; ++row)
{
// CSV: add next data row ignoring headers:
if (row > 0)
data.Add(new List<string>());
for (int col = 0; col < itable.Cols.Count; ++col)
{
var cell = itable.GetCell(row, col);
if (cell == null && row > 0)
data.Last().Add("");
else
{
if (cell != null && row > 0)
data.Last().Add($"\"{cell.Text}\"");
}
}
}
}
}
Save Extracted PDF Table Data to Another File Type (CSV)
To save the extracted table data from the PDF using C#, you’ll need to add a reference to the ‘System.Text.Encoding.CodePages’ NuGet package first, as shown below:
Then, to save the extracted table data from the PDF, we’ll need to access the stored table data from the String list object we created earlier (called ‘data’ in this example). Using that data, we’ll utilize the File class and its AppendAllLines method to create a new CSV file:
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); // needed to encode non-ASCII characters in data
File.Delete(filename);
File.AppendAllLines(
filename,
data.Where(l_ => l_.Any(s_ => !string.IsNullOrEmpty(s_))).Select(d_ => string.Join(',', d_)),
Encoding.GetEncoding(1252));
The data will now be available in a CSV file:
Original PDF | Extracted PDF Table Data in CSV File |
BONUS: Extract Data from a Multi-Page Table Using C#
What if we had a similar scenario to the one above, but this time, the table data we wanted to extract from the PDF was spread across multiple pages instead of being contained in only a single page? Don’t worry - DsPdf has you covered! Overall, the code is largely the same as above, except this time, we’ll be using a different PDF document (one that contains multiple pages), and we’ll specifically use our code to label which page of the source document the data came from. See the complete code below:
void ExtractMultiPageTableData()
{
const float DPI = 72;
const float margin = 36;
var doc = new GcPdfDocument();
var tf = new TextFormat()
{
FontSize = 9,
ForeColor = Color.Black
};
var tfHdr = new TextFormat(tf)
{
FontSize = 11,
ForeColor = Color.DarkBlue
};
var tfRed = new TextFormat(tf) { ForeColor = Color.Red };
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "product-list.pdf")))
{
var page = doc.NewPage();
page.Landscape = true;
var g = page.Graphics;
var tl = g.CreateTextLayout();
tl.MaxWidth = page.Bounds.Width;
tl.MaxHeight = page.Bounds.Height;
tl.MarginAll = margin;
tl.DefaultTabStops = 165;
var docSrc = new GcPdfDocument();
docSrc.Load(fs);
for (int i = 0; i < docSrc.Pages.Count; ++i)
{
// TableExtractOptions allow you to fine-tune table recognition accounting for
// specifics of the table formatting:
var teo = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = teo.GetMinimumDistanceBetweenRows;
// In this particular case, we slightly increase the minimum distance between rows
// to make sure cells with wrapped text are not mistaken for two cells:
teo.GetMinimumDistanceBetweenRows = (list) =>
{
var res = GetMinimumDistanceBetweenRows(list);
return res * 1.2f;
};
var top = i == 0 ? DPI * 2 : DPI;
// Get the table at the specified bounds:
var itable = docSrc.Pages[i].GetTable(new RectangleF(DPI * 0.25f, top, DPI * 8, DPI * 10.5f - top), teo);
// Add table data to the text layout:
tl.Append($"\nTable on page {i + 1} of the source document has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tfHdr);
tl.AppendParagraphBreak();
for (int row = 0; row < itable.Rows.Count; ++row)
{
var tfmt = row == 0 ? tfHdr : tf;
for (int col = 0; col < itable.Cols.Count; ++col)
{
var cell = itable.GetCell(row, col);
if (col > 0)
tl.Append("\t", tfmt);
if (cell == null)
tl.Append("<no cell>", tfRed);
else
tl.Append(cell.Text, tfmt);
}
tl.AppendLine();
}
}
// Print the extracted data:
TextSplitOptions to = new TextSplitOptions(tl) { RestMarginTop = margin, MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2 };
tl.PerformLayout(true);
while (true)
{
var splitResult = tl.Split(to, out TextLayout rest);
doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
if (splitResult != SplitResult.Split)
break;
tl = rest;
doc.NewPage().Landscape = true;
}
doc.Save("ExtractMultiPageTableData.pdf");
}
}
This will cause the exported PDF to appear as shown below:
You can download the full sample for the above code from our demo page.
BONUS: Format the Exported PDF Table Data in an Excel (XLSX) File
Did you want an XLSX file instead of a CSV file? No problem! Document Solutions for Excel (DsExcel) can also help with that.
To get started, add the ‘DS.Documents.Excel’ NuGet package to your project and add the proper statement:
using GrapeCity.Documents.Excel;
Next, we’ll initialize a DsExcel workbook instance and load the CSV file into it using the workbook’s Open method:
var workbook = new GrapeCity.Documents.Excel.Workbook();
workbook.Open($@"{fileName}", OpenFileFormat.Csv);
Now, we need to determine the range of the extracted data and wrap the cell range, apply auto-sizing to the columns, and apply styling with conditional back colors:
IWorksheet worksheet = workbook.Worksheets[0];
IRange range = worksheet.Range["A2:E10"];
// wrapping cell content
range.WrapText = true;
// styling column names
worksheet.Range["A1"].EntireRow.Font.Bold = true;
// auto-sizing range
worksheet.Range["A1:E10"].AutoFit();
// aligning cell content
worksheet.Range["A1:E10"].HorizontalAlignment = HorizontalAlignment.Center;
worksheet.Range["A1:E10"].VerticalAlignment = VerticalAlignment.Center;
// applying conditional format on UnitPrice
IColorScale twoColorScaleRule = worksheet.Range["E2:E10"].FormatConditions.AddColorScale(ColorScaleType.TwoColorScale);
twoColorScaleRule.ColorScaleCriteria[0].Type = ConditionValueTypes.LowestValue;
twoColorScaleRule.ColorScaleCriteria[0].FormatColor.Color = Color.FromArgb(255, 229, 229);
twoColorScaleRule.ColorScaleCriteria[1].Type = ConditionValueTypes.HighestValue;
twoColorScaleRule.ColorScaleCriteria[1].FormatColor.Color = Color.FromArgb(255, 20, 20);
Thread.Sleep(1000);
Lastly, we need to save the workbook as an Excel file using the workbook object’s Save method:
workbook.Save("ExtractedData_Formatted.xlsx");
With this, developers can use C# and DsPdf to programmatically extract PDF table data to another file format, such as CSV. Then, through the use of DsExcel, that PDF data can easily be converted to a stylized and formatted Excel XLSX file for easy data analysis:
Original PDF | Extracted PDF Table Data in CSV File | Formatted Excel XLSX File |
Try it for yourself! Download Document Solutions for Excel .NET Edition Today!
Document Solutions .NET PDF API Library
This article only scratches the surface of the full capabilities of Document Solutions for PDF. You can review our documentation to see the many available features or view our demos to see the features in action, along with downloadable sample projects. To learn more about Document Solutions for PDF and the latest new features available, check out our releases page.