استخراج متن از آرشیوهای ZIP/RAR با استفاده از GroupDocs.Parser

Introduction

زمانی که کسب‌وکار شما نیاز دارد تا حجم عظیمی از فاکتورها، اسناد قانونی یا خروجی‌های ایمیل که به صورت فایل‌های فشرده ZIP یا RAR می‌آیند را پردازش کند، روش سنتی این است که آن‌ها را روی دیسک استخراج (unzip) کنید، هر فایل را با یک خواننده جداگانه باز کنید و سپس فایل‌های موقت را حذف کنید. این مسیر اضافه‌بار I/O گران‌قیمت را ایجاد می‌کند، پاک‌سازی را پیچیده می‌سازد و کار با آرشیوهای تودرونی را به کابوس تبدیل می‌کند.

GroupDocs.Parser برای .NET این مشکلات را رفع می‌کند. این کتابخانه به شما اجازه می‌دهد آرشیو را به‌صورت مستقیم باز کنید، هر ورودی را فهرست کنید و متن خام (و متادیتا) را کاملاً در حافظه استخراج کنید. در این مقاله خواهید آموخت که چگونه:

بسته NuGet Parser را نصب کنید.
متن را از یک آرشیو مسطح در یک عبور استخراج کنید.
به‌صورت بازگشتی فایل‌های ZIP/RAR تودرونی را پیمایش کنید.
تنظیمات بهترین‌روش‌ها را برای پردازش مقاوم اعمال کنید.

Why In‑Memory Archive Parsing Matters

پردازش آرشیوها در حافظه به شما می‌دهد:

هیچ فایل موقت‌ایی تولید نمی‌شود – بدون شلوغی دیسک، بدون فایل‌های باقی‌مانده.
سرعت – از چرخهٔ خواندن/نوشتن اضافی برای هر ورودی جلوگیری می‌کنید.
قابلیت مقیاس‌پذیری – می‌توانید آرشیوهای بزرگ یا جریان‌های ابری را که ممکن است سیستم‌فایلی در دسترس نباشد، اداره کنید.

Prerequisites

.NET 6.0 یا بالاتر.
GroupDocs.Parser برای .NET (آخرین نسخه) – برای ارزیابی رایگان به مجوز موقت مراجعه کنید.
یک آرشیو ZIP یا RAR که شامل اسناد پشتیبانی‌شده (PDF، DOCX، TXT و غیره) باشد.

Installation

dotnet add package GroupDocs.Parser

Add the required namespaces:

using GroupDocs.Parser;
using GroupDocs.Parser.Data;
using System.Collections.Generic;
using System.IO;

Step 1 – Open the Archive

The first step is to create a Parser instance that points at the archive file. GetContainer() returns a collection of ContainerItem objects – one per entry inside the archive.

// Path to the archive you want to scan
string archivePath = "./SampleDocs/InvoicesArchive.zip";

using (Parser parser = new Parser(archivePath))
{
    // Retrieve every file (or nested archive) inside the container
    IEnumerable<ContainerItem> attachments = parser.GetContainer();

    if (attachments == null)
    {
        Console.WriteLine("Archive is empty or could not be read.");
        return;
    }

    // Hand off the collection to a helper that extracts text/metadata
    ExtractDataFromAttachments(attachments);
}

What’s happening:

The Parser constructor loads the archive without extracting it to disk.
GetContainer() lazily reads the archive’s directory and gives you ContainerItem objects you can work with.

Step 2 – Process Each Entry

ExtractDataFromAttachments walks the ContainerItem list, prints basic metadata, detects nested archives, and extracts text from regular documents. The method is completely reusable – call it once for a top‑level archive and again for any nested archive you discover.

/// <summary>
/// Recursively extracts metadata and plain‑text from each item in an archive.
/// </summary>
static void ExtractDataFromAttachments(IEnumerable<ContainerItem> attachments)
{
    foreach (ContainerItem item in attachments)
    {
        // Print a quick line with file name and size (optional)
        Console.WriteLine($"File: {item.FilePath} | Size: {item.Metadata.Size} bytes");

        try
        {
            // Each ContainerItem can open its own Parser instance
            using (Parser itemParser = item.OpenParser())
            {
                if (itemParser == null)
                {
                    // The item is not a supported document – skip it
                    continue;
                }

                // Detect nested archives by extension (case‑insensitive)
                bool isArchive = item.FilePath.EndsWith(".zip", StringComparison.OrdinalIgnoreCase) ||
                                 item.FilePath.EndsWith(".rar", StringComparison.OrdinalIgnoreCase);

                if (isArchive)
                {
                    // Recursively process the inner archive
                    IEnumerable<ContainerItem>? nested = itemParser.GetContainer();
                    if (nested != null)
                    {
                        ExtractDataFromAttachments(nested);
                    }
                }
                else
                {
                    // Regular document – extract its raw text
                    using (TextReader reader = itemParser.GetText())
                    {
                        string text = reader.ReadToEnd();
                        Console.WriteLine($"Extracted {text.Length} characters from {item.FilePath}");
                        // Here you could store `text` in a database, index it, etc.
                    }
                }
            }
        }
        catch (UnsupportedDocumentFormatException)
        {
            // The file type is not supported by GroupDocs.Parser – ignore gracefully
            Console.WriteLine($"Skipping unsupported format: {item.FilePath}");
        }
    }
}

Key Points

Metadata access – item.Metadata gives you file name, size, creation date, etc., without reading the file contents.
Recursive handling – The same method calls itself when it encounters another ZIP/RAR, giving you unlimited nesting support.
Error resilience – UnsupportedDocumentFormatException is caught so a single bad file won’t abort the whole run.

Step 3 – Putting It All Together

Below is a minimal, copy‑pasteable program that combines the two snippets above. It demonstrates a full end‑to‑end flow: install, open, process, and report.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;
using System;
using System.Collections.Generic;
using System.IO;

class ArchiveTextExtractor
{
    static void Main(string[] args)
    {
        string archivePath = args.Length > 0 ? args[0] : "./SampleDocs/InvoicesArchive.zip";
        using (Parser parser = new Parser(archivePath))
        {
            IEnumerable<ContainerItem> attachments = parser.GetContainer();
            if (attachments == null)
            {
                Console.WriteLine("No items found in the archive.");
                return;
            }
            ExtractDataFromAttachments(attachments);
        }
    }

    static void ExtractDataFromAttachments(IEnumerable<ContainerItem> attachments)
    {
        foreach (ContainerItem item in attachments)
        {
            Console.WriteLine($"File: {item.FilePath} | Size: {item.Metadata.Size} bytes");
            try
            {
                using (Parser itemParser = item.OpenParser())
                {
                    if (itemParser == null) continue;

                    bool isArchive = item.FilePath.EndsWith(".zip", StringComparison.OrdinalIgnoreCase) ||
                                     item.FilePath.EndsWith(".rar", StringComparison.OrdinalIgnoreCase);

                    if (isArchive)
                    {
                        var nested = itemParser.GetContainer();
                        if (nested != null) ExtractDataFromAttachments(nested);
                    }
                    else
                    {
                        using (TextReader reader = itemParser.GetText())
                        {
                            string text = reader.ReadToEnd();
                            Console.WriteLine($"Extracted {text.Length} chars from {item.FilePath}");
                        }
                    }
                }
            }
            catch (UnsupportedDocumentFormatException)
            {
                Console.WriteLine($"Unsupported format: {item.FilePath}");
            }
        }
    }
}

Run the program with the path to your archive:

dotnet run -- ./Data/LegalDocs.zip

Best Practices & Tips

Limit parsing options – By default Parser extracts all supported content. If you only need text, avoid calling additional heavy methods like GetImages().
Large archives – Process items sequentially as shown; avoid loading all texts into memory at once.
Performance – Skip nested archives you don’t need by checking the file extension before recursing.
Error handling – Always catch UnsupportedDocumentFormatException; many corporate archives contain binaries that the parser cannot read.

Conclusion

GroupDocs.Parser برای .NET راهی پاک و در‑حافظه برای خواندن هر سند داخل آرشیوهای ZIP یا RAR فراهم می‌کند، حتی اگر به‌صورت عمیق تودرونی باشند. تنها با چند خط کد می‌توانید خطوط پردازش «استخراج‑فایل‑فشرده‑به‌علاوه‑تحلیل» پیچیده را جایگزین کنید، بار I/O را کاهش دهید و سرویس‌های ورودی سند قابل اطمینان بسازید.

Next steps

امکانات مقایسه سند یا استخراج متادیتا را بررسی کنید.
یاد بگیرید چگونه تصاویر را از فایل‌های فشرده با همان API استخراج کنید.
متن استخراج‌شده را به یک شاخص جستجو یا خط پایپ‌لاین هوش مصنوعی متصل کنید.

چگونه متن را از آرشیوهای ZIP/RAR با استفاده از GroupDocs.Parser در .NET استخراج کنیم

Introduction

Why In‑Memory Archive Parsing Matters

Prerequisites

Installation

Step 1 – Open the Archive

Step 2 – Process Each Entry

Key Points

Step 3 – Putting It All Together

Best Practices & Tips

Conclusion

Additional Resources

Introduction#

Why In‑Memory Archive Parsing Matters#

Prerequisites#

Installation#

Step 1 – Open the Archive#

Step 2 – Process Each Entry#

Key Points#

Step 3 – Putting It All Together#

Best Practices & Tips#

Conclusion#

Additional Resources#

Introduction

Why In‑Memory Archive Parsing Matters

Prerequisites

Installation

Step 1 – Open the Archive

Step 2 – Process Each Entry

Key Points

Step 3 – Putting It All Together

Best Practices & Tips

Conclusion

Additional Resources