diff --git a/doc.md b/doc.md index 58fb074..1b20032 100644 --- a/doc.md +++ b/doc.md @@ -1,259 +1,205 @@ -# User Documentation for Pdfbox3.xqm XQuery Library +# User Guide for Pdfbox3.xqm Library (XAR Distribution) -## Overview +## Introduction -The `Pdfbox3.xqm` library provides an interface to the Apache PDFBox 3.0 library for working with PDF documents in BaseX 10.7+. It allows you to perform various operations on PDF files, such as extracting text, rendering pages to images, extracting metadata, and more. -## Namespace +The `Pdfbox3.xqm` library is an XQuery module designed to interface with **Apache PDFBox 3.0**, a powerful Java library for working with PDF documents. This module allows you to perform various operations on PDF files, such as extracting text, rendering pages as images, managing outlines, and more. The library is distributed as a **XAR (XQuery Archive) file**, which includes the necessary PDFBox JAR files, making it easy to install and use in BaseX 10.7+. -The library uses the namespace `org.expkg_zone58.Pdfbox3`. +--- + +## Installation + +### 1. Download the XAR File +The library is distributed as a XAR file that includes the required PDFBox JAR files. You can obtain the XAR file from the distribution source (e.g., a repository or a shared location). + +### 2. Install the XAR File in BaseX +To install the XAR file in BaseX, follow these steps: + +1. Open the BaseX GUI or command-line interface. +2. Use the `REPO INSTALL` command to install the XAR file: + + ```xquery + REPO INSTALL path/to/pdfbox3.xar + ``` + + Replace `path/to/pdfbox3.xar` with the actual path to the XAR file. + +3. Verify the installation by listing the installed packages: + + ```xquery + REPO LIST + ``` + + You should see `pdfbox3` listed among the installed packages. + +--- + +## Basic Usage + +### Importing the Module +Once the XAR file is installed, you can import the module in your XQuery scripts: ```xquery -module namespace pdfbox="org.expkg_zone58.Pdfbox3"; +import module namespace pdfbox="org.expkg_zone58.Pdfbox3"; ``` -## Functions +--- -### `pdfbox:with-pdf($src as xs:string, $fn as function(item()) as item()*) as item()*` - -This function opens a PDF file, applies a given function to it, and ensures the PDF is closed after use. - -- **Parameters:** - - `$src`: The path to the PDF file. - - `$fn`: A function that takes a PDF object as input and returns some result. - -- **Example:** - ```xquery - pdfbox:with-pdf("path/to/document.pdf", pdfbox:page-text(?, 5)) - ``` - -### `pdfbox:open-file($pdfpath as xs:string) as item()` - -Opens a PDF file and returns a PDF object. - -- **Parameters:** - - `$pdfpath`: The path to the PDF file. - -- **Example:** - ```xquery - let $pdf := pdfbox:open-file("path/to/document.pdf") - return pdfbox:page-count($pdf) - ``` - -### `pdfbox:specification($pdf as item()) as xs:string` - -Returns the version of the PDF specification used by the document. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:specification($pdf) - ``` - -### `pdfbox:save($pdf as item(), $savepath as xs:string) as xs:string` - -Saves the PDF object to the specified file path. - -- **Parameters:** - - `$pdf`: A PDF object. - - `$savepath`: The path where the PDF should be saved. - -- **Example:** - ```xquery - pdfbox:save($pdf, "path/to/save/document.pdf") - ``` - -### `pdfbox:close($pdf as item()) as empty-sequence()` - -Closes the PDF object, releasing resources. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:close($pdf) - ``` - -### `pdfbox:page-count($pdf as item()) as xs:integer` - -Returns the number of pages in the PDF. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:page-count($pdf) - ``` - -### `pdfbox:page-image($pdf as item(), $pageNo as xs:integer, $options as map(*)) as xs:base64Binary` - -Renders a specific page of the PDF as an image. - -- **Parameters:** - - `$pdf`: A PDF object. - - `$pageNo`: The page number to render. - - `$options`: A map of options, including `format` (e.g., "gif", "png") and `scale`. - -- **Example:** - ```xquery - pdfbox:page-image($pdf, 1, map { "format": "png", "scale": 2 }) - ``` - -### `pdfbox:metadata($pdf as item()) as map(*)` - -Returns a map containing metadata about the PDF. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:metadata($pdf) - ``` - -### `pdfbox:report($pdfpath as xs:string) as map(*)` - -Returns a summary of the PDF, including metadata and page count. - -- **Parameters:** - - `$pdfpath`: The path to the PDF file. - -- **Example:** - ```xquery - pdfbox:report("path/to/document.pdf") - ``` - -### `pdfbox:hasOutline($pdf as item()) as xs:boolean` - -Returns `true` if the PDF has an outline (bookmarks). - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:hasOutline($pdf) - ``` - -### `pdfbox:isEncrypted($pdf as item()) as xs:boolean` - -Returns `true` if the PDF is encrypted. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:isEncrypted($pdf) - ``` - -### `pdfbox:outline($pdf as item()) as map(*)*` - -Returns the outline (bookmarks) of the PDF as a sequence of maps. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:outline($pdf) - ``` - -### `pdfbox:outline-xml($pdf as item()) as element(outline)?` - -Returns the outline (bookmarks) of the PDF as XML. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:outline-xml($pdf) - ``` - -### `pdfbox:extract($pdf as item(), $start as xs:integer, $end as xs:integer, $target as xs:string) as xs:string` - -Extracts a range of pages from the PDF and saves them as a new PDF. - -- **Parameters:** - - `$pdf`: A PDF object. - - `$start`: The starting page number (1-based). - - `$end`: The ending page number (1-based). - - `$target`: The path to save the new PDF. - -- **Example:** - ```xquery - pdfbox:extract($pdf, 1, 3, "path/to/new/document.pdf") - ``` - -### `pdfbox:labels($pdf as item()) as xs:string*` - -Returns the page labels for each page in the PDF. - -- **Parameters:** - - `$pdf`: A PDF object. - -- **Example:** - ```xquery - pdfbox:labels($pdf) - ``` - -### `pdfbox:page-text($doc as item(), $pageNo as xs:integer) as xs:string` - -Returns the text content of a specific page in the PDF. - -- **Parameters:** - - `$doc`: A PDF object. - - `$pageNo`: The page number to extract text from. - -- **Example:** - ```xquery - pdfbox:page-text($pdf, 1) - ``` - -### `pdfbox:version() as xs:string` - -Returns the version of the Apache PDFBox library in use. - -- **Example:** - ```xquery - pdfbox:version() - ``` - -## Notes - -- The library is designed to work with BaseX 10.7+. -- Some functions may throw errors if the PDF is encrypted or if the file cannot be opened. - -## Examples - -### Extracting Text from a PDF Page +### Opening a PDF Document +To open a PDF document, use the `pdfbox:open` function. This function can handle local files, URLs, or binary data. ```xquery -let $pdf := pdfbox:open-file("path/to/document.pdf") -return pdfbox:page-text($pdf, 1) +let $pdf := pdfbox:open("path/to/document.pdf") ``` -### Rendering a PDF Page as an Image +If the PDF is encrypted, you can provide a password: ```xquery -let $pdf := pdfbox:open-file("path/to/document.pdf") -return pdfbox:page-image($pdf, 1, map { "format": "png", "scale": 2 }) +let $pdf := pdfbox:open("path/to/encrypted.pdf", map{"password": "your_password"}) ``` -### Extracting Metadata +--- + +### Closing a PDF Document +Always close the PDF document after use to release resources. ```xquery -let $pdf := pdfbox:open-file("path/to/document.pdf") -return pdfbox:metadata($pdf) +pdfbox:close($pdf) ``` +--- + +### Extracting Text from a Page +To extract text from a specific page, use the `pdfbox:page-text` function. + +```xquery +let $text := pdfbox:page-text($pdf, 1) (: Extract text from page 1 :) +``` + +--- + +### Rendering a Page as an Image +You can render a PDF page as an image using the `pdfbox:page-image` function. Supported formats include `jpg`, `png`, `bmp`, and `gif`. + +```xquery +let $image := pdfbox:page-image($pdf, 1, map{"format": "png", "scale": 2}) +``` + +- `format`: The image format (default is `jpg`). +- `scale`: The scaling factor (default is `1`, which corresponds to 72 DPI). + +--- + ### Extracting a Range of Pages +To extract a range of pages from a PDF, use the `pdfbox:extract` function. ```xquery -let $pdf := pdfbox:open-file("path/to/document.pdf") -return pdfbox:extract($pdf, 1, 3, "path/to/new/document.pdf") +let $extracted := pdfbox:extract($pdf, 1, 3) (: Extract pages 1 to 3 :) ``` + +The result is a new PDF document in binary format. + +--- + +### Getting Document Properties +You can retrieve various properties of a PDF document, such as the title, author, and creation date. + +```xquery +let $title := pdfbox:property($pdf, "title") +let $author := pdfbox:property($pdf, "author") +``` + +Supported properties include: +- `pageCount`: Number of pages. +- `title`: Document title. +- `author`: Document author. +- `creator`: Document creator. +- `producer`: Document producer. +- `subject`: Document subject. +- `keywords`: Document keywords. +- `creationDate`: Document creation date. +- `modificationDate`: Document modification date. + +--- + +### Working with Outlines (Bookmarks) +To retrieve the outline (bookmarks) of a PDF, use the `pdfbox:outline` function. + +```xquery +let $outline := pdfbox:outline($pdf) +``` + +The outline is returned as a sequence of maps, where each map represents a bookmark with properties like `title`, `index`, and `hasChildren`. + +--- + +### Saving a PDF Document +To save a PDF document to the filesystem, use the `pdfbox:save` function. + +```xquery +let $savedPath := pdfbox:save($pdf, "path/to/save/document.pdf") +``` + +--- + +## Advanced Usage + +### Handling Encrypted PDFs +If the PDF is encrypted, you can provide a password when opening the document. + +```xquery +let $pdf := pdfbox:open("path/to/encrypted.pdf", map{"password": "your_password"}) +``` + +--- + +### Getting Page Labels +To retrieve page labels (if they exist), use the `pdfbox:labels` function. + +```xquery +let $labels := pdfbox:labels($pdf) +``` + +--- + +### Getting Page Size +To get the size of a specific page, use the `pdfbox:page-size` function. + +```xquery +let $size := pdfbox:page-size($pdf, 1) (: Get size of page 1 :) +``` + +--- + +### Generating a Report +You can generate a CSV-style report of properties for multiple PDFs using the `pdfbox:report` function. + +```xquery +let $report := pdfbox:report(("path/to/doc1.pdf", "path/to/doc2.pdf")) +``` + +The report includes properties like `title`, `author`, `pageCount`, etc., for each PDF. + +--- + +## Error Handling +The library includes error handling to manage issues such as failed PDF loads or unsupported operations. Errors are thrown with descriptive messages to help diagnose problems. + +```xquery +try { + let $pdf := pdfbox:open("invalid/path.pdf") + return pdfbox:page-text($pdf, 1) +} catch * { + fn:error($err:code, $err:description) +} +``` + +--- + +## Conclusion +The `Pdfbox3.xqm` library, distributed as a XAR file with included PDFBox JAR files, provides a comprehensive interface for working with PDF documents in XQuery. By leveraging Apache PDFBox, it offers powerful features for text extraction, image rendering, and document manipulation. With this guide, you should be able to integrate PDF processing into your XQuery applications effectively. + +For more detailed information, refer to the [Apache PDFBox documentation](https://pdfbox.apache.org/docs/3.0.0/javadocs/) and the [BaseX documentation](https://docs.basex.org/). + +--- + +This user guide provides a starting point for using the `Pdfbox3.xqm` library. For further assistance, consult the official documentation or reach out to the community for support. \ No newline at end of file