.gitea | ||
.github | ||
.vscode | ||
jars | ||
samples.pdf | ||
scripts | ||
src | ||
.gitignore | ||
.xqdoca | ||
changelog.md | ||
doc.md | ||
LICENSE | ||
package.json | ||
readme.md |
Pdfbox
A BaseX
interface for the Apache Pdfbox library
version 3.
The Apache PDFBox® library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.
This interface is packaged in the Expath format. A test suite is available and workflow actions run this on BaseX 10.7 and 11.7.
Note
Currently (v0.1.5) works with BaseX 9.7, but this may change with future versions.
- The Apache Pdfbox 3 FAQ may be useful.
Features
The features focus on extracting information from PDFs rather than creation or editing.
- read PDF page count.
- read any PDF outline and return as map(s) or XML.
- read pagelabels.
- read page text.
- save pdf page range to a new pdf.
- save image of rendered pdf page.
AI (Deepseek) generated documentation
Install
Pre-built pdfbox-x.y.z.zar
files are available on the releases page. They can be installed using the standard respository functions or using the GUI.
Usage
import module namespace pdfbox="org.expkg_zone58.Pdfbox3";
pdfbox:with-pdf("...path/to/pdf.pdf",
function($pdf){
(1 to pdfbox:page-count($pdf))!pdfbox:page-text($pdf,.)
}
)
Build
scripts/make-xar.xq
packages the requiredjar
s andxqm
files to axar
file in thedist
folder.
Action support
The workflow ci-basex.yaml
builds and tests the package. This can be used as an action on github, or on a local gitea installation.