1
0
Fork 0
pdfbox/readme.md

57 lines
2.2 KiB
Markdown
Raw Permalink Normal View History

2025-02-10 17:17:30 +00:00
# Pdfbox
2025-02-10 22:10:18 +00:00
A `BaseX` interface for the `Apache Pdfbox library` version 3.
2025-02-10 17:17:30 +00:00
2025-02-10 22:10:18 +00:00
The [Apache PDFBox® library](https://pdfbox.apache.org/) is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.
2025-02-14 11:01:20 +00:00
This interface is packaged in the [Expath](https://docs.basex.org/main/Repository#expath_packaging) format. The package includes the required Pdfbox jars.
2025-02-10 22:10:18 +00:00
A test suite is available and workflow actions run this on BaseX 10.7 and 11.7.
> [!NOTE]
>Currently (v0.1.5) works with BaseX 9.7, but this may change with future versions.
2025-02-10 17:17:30 +00:00
## Features
2025-02-16 22:31:13 +00:00
The features focus on extracting information from PDFs rather than creation or editing of PDFs.
### Supported
2025-02-10 17:17:30 +00:00
* read PDF page count.
* read any PDF outline and return as map(s) or XML.
* read pagelabels.
* read page text.
* save pdf page range to a new pdf.
* save image of rendered pdf page.
2025-02-16 22:31:13 +00:00
* open PDF with password
2025-02-18 16:20:02 +00:00
* support for xs:base64Binary in function inputs and outputs to facilitate database and store usage.
2025-02-16 22:31:13 +00:00
### Not supported:
* creating completely new PDFs
* Page size information
2025-02-18 16:20:02 +00:00
* XMP processing
* Form processing
2025-02-10 17:17:30 +00:00
2025-02-16 22:31:13 +00:00
## Documentation
* Function [documentation](doc.md)
2025-02-12 16:10:53 +00:00
* The Apache Pdfbox 3 [FAQ](https://pdfbox.apache.org/3.0/faq.html) may be useful.
2025-02-10 17:17:30 +00:00
# Install
2025-02-12 16:13:04 +00:00
Pre-built `pdfbox-x.y.z.zar` files are available on the [releases](../../releases) page. They can be installed using the standard respository functions or using the GUI.
2025-02-10 17:17:30 +00:00
# Usage
```xquery
import module namespace pdfbox="org.expkg_zone58.Pdfbox3";
pdfbox:with-pdf("...path/to/pdf.pdf",
function($pdf){
(1 to pdfbox:page-count($pdf))!pdfbox:page-text($pdf,.)
}
)
```
## Build
* `scripts/make-xar.xq` packages the required `jar`s and `xqm` files to a `xar` file in the `dist` folder.
2025-02-12 16:10:53 +00:00
The `package.json` is used/abused as a configuration source. Non standard information is held in the `expkg_zone58` section. This is experimental and may change.
2025-02-10 17:17:30 +00:00
### Action support
The workflow `ci-basex.yaml` builds and tests the package. This can be used as an action on [github](https://github.com/features/actions), or on a local [gitea](https://docs.gitea.com/usage/actions/overview) installation.