pdfbox/readme.md

# Pdfbox
A `BaseX` interface for the `Apache Pdfbox library` version 3. 

The [Apache PDFBox® library](https://pdfbox.apache.org/) is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.

This interface is packaged in the [Expath](https://docs.basex.org/main/Repository#expath_packaging) format.
A test suite is available and workflow actions run this on BaseX 10.7 and 11.7.

> [!NOTE]  
>Currently (v0.1.5) works with BaseX 9.7, but this may change with future versions.

* The Apache Pdfbox 3 [FAQ](https://pdfbox.apache.org/3.0/faq.html) may be useful.
## Features

The features focus on extracting information from PDFs rather than creation or editing.

* read PDF page count.
* read any PDF outline and return as map(s) or XML.
* read pagelabels.
* read page text.
* save pdf page range to a new pdf.
* save image of rendered pdf page.


# Install
Pre-built `pdfbox-x.y.z.zar` files are available on the releases page. They can be installed using the standard respository functions or using the GUI.

# Usage
```xquery
import module namespace pdfbox="org.expkg_zone58.Pdfbox3";

pdfbox:with-pdf("...path/to/pdf.pdf",
 function($pdf){
  (1 to pdfbox:page-count($pdf))!pdfbox:page-text($pdf,.)
 }
)
```

## Build

* `scripts/make-xar.xq` packages the required `jar`s and `xqm` files to a `xar` file in the `dist` folder.

### Action support

The workflow `ci-basex.yaml` builds and tests the package. This can be used as an action on [github](https://github.com/features/actions), or on a local [gitea](https://docs.gitea.com/usage/actions/overview) installation.
[mod] tidy 2025-02-10 17:17:30 +00:00			`# Pdfbox`
[mod] docs 2025-02-10 22:10:18 +00:00			A `BaseX` interface for the `Apache Pdfbox library` version 3.
[mod] tidy 2025-02-10 17:17:30 +00:00
[mod] docs 2025-02-10 22:10:18 +00:00			`The [Apache PDFBox® library](https://pdfbox.apache.org/) is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.`

			`This interface is packaged in the [Expath](https://docs.basex.org/main/Repository#expath_packaging) format.`
			`A test suite is available and workflow actions run this on BaseX 10.7 and 11.7.`

			`> [!NOTE]`
			`>Currently (v0.1.5) works with BaseX 9.7, but this may change with future versions.`

			`* The Apache Pdfbox 3 [FAQ](https://pdfbox.apache.org/3.0/faq.html) may be useful.`
[mod] tidy 2025-02-10 17:17:30 +00:00			`## Features`

			`The features focus on extracting information from PDFs rather than creation or editing.`

			`* read PDF page count.`
			`* read any PDF outline and return as map(s) or XML.`
			`* read pagelabels.`
			`* read page text.`
			`* save pdf page range to a new pdf.`
			`* save image of rendered pdf page.`



			`# Install`
			Pre-built `pdfbox-x.y.z.zar` files are available on the releases page. They can be installed using the standard respository functions or using the GUI.

			`# Usage`
			```xquery
			`import module namespace pdfbox="org.expkg_zone58.Pdfbox3";`

			`pdfbox:with-pdf("...path/to/pdf.pdf",`
			`function($pdf){`
			`(1 to pdfbox:page-count($pdf))!pdfbox:page-text($pdf,.)`
			`}`
			`)`
			```

			`## Build`

			* `scripts/make-xar.xq` packages the required `jar`s and `xqm` files to a `xar` file in the `dist` folder.

			`### Action support`

			The workflow `ci-basex.yaml` builds and tests the package. This can be used as an action on [github](https://github.com/features/actions), or on a local [gitea](https://docs.gitea.com/usage/actions/overview) installation.