How to Read and Extract Table Data from PDF Using C# .NET
| Quick Start Guide | |
|---|---|
| Tutorial Concept |
Learn how to extract table data from PDF files in C# .NET using a server-side PDF API library. This tutorial shows developers how to recognize table structures in PDF documents, extract cell data programmatically, export the results to CSV, and optionally format the extracted data in an Excel XLSX file. |
| What You Will Need |
.NET 8 or higher, the DS.Documents.Pdf NuGet package, and optionally the DS.Documents.Excel NuGet package if you want to format the extracted PDF table data in an XLSX workbook. |
| Controls Referenced | |
Manually copying table data from PDF files is time-consuming, error-prone, and difficult to scale. For developers building document automation workflows, a better approach is to programmatically detect, extract, and export PDF table data using a .NET PDF API.
Document Solutions for PDF .NET (DsPdf .NET) makes this process easier by providing a C# API for working with PDF content, including table recognition and text extraction. With DsPdf .NET, developers can extract structured table data from PDF documents and save it to formats such as CSV or XLSX for reporting, analysis, and downstream processing.
In this tutorial, you’ll learn how to use C# and Document Solutions for PDF .NET to extract table data from a PDF file, export the results to CSV, handle multi-page tables, and optionally format the extracted data in an Excel workbook using Document Solutions for Excel .NET (DsExcel .NET).
Important Information About Tables in PDF Documents
Before extracting table data from a PDF, it’s important to understand how tables typically exist inside PDF files.
Unlike Excel or Word documents, PDF files do not usually store tables as structured objects with rows, columns, and cells. Instead, what looks like a table is often just a collection of positioned text, lines, and shapes arranged visually on the page.
Because of this, extracting table data from PDFs requires more than simply reading a built-in table object. The API must analyze the page layout, recognize the table region, determine row and column boundaries, and then return the content in a usable structure.
Document Solutions for PDF .NET helps solve this problem by providing table extraction capabilities that allow developers to identify table-like content and retrieve it programmatically.
How to Extract Table Data from PDF Documents Programmatically Using C#
- Create a .NET Core Console Application with DsPdf .NET included
- Load a Sample PDF Containing a Data Table
- Define Table Recognition Parameters
- Extract the PDF Table Data with C#
- Save Extracted PDF Table Data to a CSV File
- Extract Data from a Multi-Page PDF Table
- Programmatically Format the Exported PDF Table Data in an Excel XLSX File
If you prefer to review the finished implementation, you can download the completed sample application or view a similar demo with full source code from our demo website.
Download a Trial of our .NET PDF or Excel API Library Today!
Create a .NET Core Console Application with DsPdf .NET Included
To get started, create a new .NET Core Console Application in Visual Studio.
Next, right-click the project’s Dependencies node and select Manage NuGet Packages. In the Browse tab, search for the following package:
DS.Documents.Pdf
Install this .NET PDF API package into your project.

When prompted, accept the license agreement. If the license acceptance dialog is not accepted, the required DLL files may not load correctly.

After installing the package, open Program.cs and add the following using statements:
using System.Text;
using GrapeCity.Documents.Pdf;
using GrapeCity.Documents.Pdf.Recognition;
using System.Linq;
These namespaces provide access to the DsPdf .NET document object model, table recognition APIs, and supporting functionality needed throughout the tutorial.
Load the Sample PDF that Contains a Data Table
Next, create a GcPdfDocument instance and load the PDF file that contains the table you want to extract.
The following code loads a sample PDF named zugferd-invoice.pdf from a local Resources/PDFs folder:
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
// Initialize GcPdf document object
var pdfDoc = new GcPdfDocument();
// Load a PDF document
pdfDoc.Load(fs);
}
In this example, the PDF contains an invoice table that we want to extract and save in another format.

Define Table Recognition Parameters
After loading the PDF, define the region of the page where the table appears. DsPdf .NET uses this region to focus table recognition on the relevant area of the page.
In the following code, a RectangleF object defines the approximate table bounds. The example also sets a DPI value of 72, which corresponds to the default PDF page resolution and is used for measurement calculations:
const float DPI = 72;
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
// Initialize GcPdf document object
var pdfDoc = new GcPdfDocument();
// Load a PDF document
pdfDoc.Load(fs);
// The approximate table bounds:
var tableBounds = new RectangleF(0, 2.5f * DPI, 8.5f * DPI, 3.75f * DPI);
}
Next, use the TableExtractOptions class to fine-tune table recognition. This class allows you to customize how DsPdf .NET detects table structure, including spacing between rows and columns.
In this example, the minimum distance between rows is slightly increased. This helps prevent wrapped text within a single cell from being incorrectly recognized as multiple rows:
// TableExtractOptions allows us to fine-tune table recognition accounting for
// the specifics of the table formatting:
var tableExtrctOpt = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = tableExtrctOpt.GetMinimumDistanceBetweenRows;
// In this particular case, we slightly increase the minimum distance between rows
// to make sure cells with wrapped text are not mistaken for two cells:
tableExtrctOpt.GetMinimumDistanceBetweenRows = (list) =>
{
var res = GetMinimumDistanceBetweenRows(list);
return res * 1.2f;
};
Programmatically Extract the PDF Table Data
Once the PDF is loaded and the table extraction options are configured, you can extract the table data using the GetTable method.
The GetTable method returns an ITable interface, which provides access to the recognized rows, columns, and cells. You can then loop through the table using Rows.Count, Cols.Count, and GetCell(rowIndex, colIndex).
The following code extracts the table data and stores it in a list that can later be written to a CSV file:
for (int i = 0; i < pdfDoc.Pages.Count; ++i)
{
// Get the table at the specified bounds:
var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
if (itable != null)
{
for (int row = 0; row < itable.Rows.Count; ++row)
{
// CSV: add next data row ignoring headers:
if (row > 0)
data.Add(new List<string>());
for (int col = 0; col < itable.Cols.Count; ++col)
{
var cell = itable.GetCell(row, col);
if (cell == null && row > 0)
data.Last().Add("");
else
{
if (cell != null && row > 0)
data.Last().Add($"\"{cell.Text}\"");
}
}
}
}
}
This code checks each page for a table within the defined bounds. If a table is found, it loops through the rows and columns, extracts each cell’s text, and stores the results for export.
Save Extracted PDF Table Data to a CSV File
After extracting the table data, you can save it to a CSV file.
First, add a reference to the following NuGet package:
System.Text.Encoding.CodePages

This package is needed to support additional text encodings when writing the CSV file.
Next, register the encoding provider and write the extracted table data to a CSV file using File.AppendAllLines:
Once the code runs, the extracted PDF table data is available in a CSV file:
| Original PDF | Extracted PDF Table Data in CSV File |
![]() |
![]() |
How to Extract Data from a Multi-Page Table Using C#
In some cases, the table you need to extract may span multiple PDF pages. The extraction process is similar, but the code needs to loop through each source page and capture table data from each one.
The following example loads a multi-page PDF named product-list.pdf, extracts table data from each page, and writes the extracted content to a new PDF document. The output identifies which source page each table came from and includes the recognized table data.
void ExtractMultiPageTableData()
{
const float DPI = 72;
const float margin = 36;
var doc = new GcPdfDocument();
var tf = new TextFormat()
{
FontSize = 9,
ForeColor = Color.Black
};
var tfHdr = new TextFormat(tf)
{
FontSize = 11,
ForeColor = Color.DarkBlue
};
var tfRed = new TextFormat(tf) { ForeColor = Color.Red };
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "product-list.pdf")))
{
var page = doc.NewPage();
page.Landscape = true;
var g = page.Graphics;
var tl = g.CreateTextLayout();
tl.MaxWidth = page.Bounds.Width;
tl.MaxHeight = page.Bounds.Height;
tl.MarginAll = margin;
tl.DefaultTabStops = 165;
var docSrc = new GcPdfDocument();
docSrc.Load(fs);
for (int i = 0; i < docSrc.Pages.Count; ++i)
{
// TableExtractOptions allow you to fine-tune table recognition accounting for
// specifics of the table formatting:
var teo = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = teo.GetMinimumDistanceBetweenRows;
// In this particular case, we slightly increase the minimum distance between rows
// to make sure cells with wrapped text are not mistaken for two cells:
teo.GetMinimumDistanceBetweenRows = (list) =>
{
var res = GetMinimumDistanceBetweenRows(list);
return res * 1.2f;
};
var top = i == 0 ? DPI * 2 : DPI;
// Get the table at the specified bounds:
var itable = docSrc.Pages[i].GetTable(
new RectangleF(DPI * 0.25f, top, DPI * 8, DPI * 10.5f - top),
teo);
// Add table data to the text layout:
tl.Append($"\nTable on page {i + 1} of the source document has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tfHdr);
tl.AppendParagraphBreak();
for (int row = 0; row < itable.Rows.Count; ++row)
{
var tfmt = row == 0 ? tfHdr : tf;
for (int col = 0; col < itable.Cols.Count; ++col)
{
var cell = itable.GetCell(row, col);
if (col > 0)
tl.Append("\t", tfmt);
if (cell == null)
tl.Append("<no cell>", tfRed);
else
tl.Append(cell.Text, tfmt);
}
tl.AppendLine();
}
}
// Print the extracted data:
TextSplitOptions to = new TextSplitOptions(tl)
{
RestMarginTop = margin,
MinLinesInFirstParagraph = 2,
MinLinesInLastParagraph = 2
};
tl.PerformLayout(true);
while (true)
{
var splitResult = tl.Split(to, out TextLayout rest);
doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
if (splitResult != SplitResult.Split)
break;
tl = rest;
doc.NewPage().Landscape = true;
}
doc.Save("ExtractMultiPageTableData.pdf");
}
}
This approach is useful when working with long reports, product catalogs, invoices, statements, or any PDF where tabular data continues across multiple pages.

You can also include a link to the full sample from the Document Solutions demo page so readers can download and test the completed project.
Programmatically Format the Exported PDF Table Data into an Excel XLSX File using C#
CSV is a useful export format, but some workflows require formatted Excel (.xlsx) files instead. If you want to convert the extracted PDF table data into a styled XLSX workbook, you can use Document Solutions for Excel .NET (DsExcel .NET).
Download a Trial of our .NET PDF or Excel API Library Today!
Start by installing the following NuGet package:
DS.Documents.Excel

Then add the required namespace:
using GrapeCity.Documents.Excel;
Next, initialize a DsExcel .NET workbook and load the CSV file created from the extracted PDF table data using the open method:
var workbook = new GrapeCity.Documents.Excel.Workbook();
workbook.Open($@"{fileName}", OpenFileFormat.Csv);
After loading the CSV file, access the worksheet and apply formatting. The following code wraps cell content, bolds the header row, auto-sizes the columns, aligns the cell content, and applies conditional formatting to the UnitPrice column:
IWorksheet worksheet = workbook.Worksheets[0];
IRange range = worksheet.Range["A2:E10"];
// wrapping cell content
range.WrapText = true;
// styling column names
worksheet.Range["A1"].EntireRow.Font.Bold = true;
// auto-sizing range
worksheet.Range["A1:E10"].AutoFit();
// aligning cell content
worksheet.Range["A1:E10"].HorizontalAlignment = HorizontalAlignment.Center;
worksheet.Range["A1:E10"].VerticalAlignment = VerticalAlignment.Center;
// applying conditional format on UnitPrice
IColorScale twoColorScaleRule = worksheet.Range["E2:E10"].FormatConditions.AddColorScale(ColorScaleType.TwoColorScale);
twoColorScaleRule.ColorScaleCriteria[0].Type = ConditionValueTypes.LowestValue;
twoColorScaleRule.ColorScaleCriteria[0].FormatColor.Color = Color.FromArgb(255, 229, 229);
twoColorScaleRule.ColorScaleCriteria[1].Type = ConditionValueTypes.HighestValue;
twoColorScaleRule.ColorScaleCriteria[1].FormatColor.Color = Color.FromArgb(255, 20, 20);
Thread.Sleep(1000);
Finally, save the formatted workbook as an XLSX file:
workbook.Save("ExtractedData_Formatted.xlsx");
With DsPdf .NET and DsExcel .NET, developers can programmatically extract table data from a PDF, export it to CSV, and then convert it into a polished Excel workbook for analysis, reporting, or sharing.
With this, developers can use C# and DsPdf .NET to programmatically extract PDF table data to another file format, such as CSV. Then, through the use of DsExcel .NET, that PDF data can easily be converted to a stylized and formatted Excel XLSX file for easy data analysis:
| Original PDF | Extracted PDF Table Data in CSV File | Formatted Excel XLSX File |
![]() |
![]() |
![]() |
Document Solutions .NET PDF API Library
This tutorial demonstrates one practical way to use Document Solutions for PDF to extract table data from PDF documents in C# .NET. By combining DsPdf .NET’s table recognition capabilities with CSV export and optional Excel formatting through DsExcel .NET, developers can automate workflows that would otherwise require manual data entry or copy-and-paste cleanup.
Document Solutions for PDF .NET includes many additional features for creating, editing, converting, searching, annotating, signing, and securing PDF documents. To continue learning, review the DsPdf .NET documentation, explore the online demos, or visit the Document Solutions release pages to see the latest features available in the .NET PDF API library.




