1
0
Fork 0
No description
Find a file
2025-02-18 16:36:21 +00:00
.gitea/workflows [mod] kick 2025-02-18 16:21:23 +00:00
.github [mod] move tests 2025-02-15 22:33:36 +00:00
.vscode [mod] memory usage 2025-02-18 16:20:02 +00:00
jars [fix] ignore 2025-02-03 14:34:48 +00:00
samples.pdf [mod] 0.2.4 2025-02-16 22:31:13 +00:00
scripts [mod] memory usage 2025-02-18 16:20:02 +00:00
src [mod] memory usage 2025-02-18 16:20:02 +00:00
tests [add] coverage 2025-02-17 10:47:14 +00:00
.gitignore [mod] git ignore 2025-02-03 14:31:30 +00:00
.xqdoca [mod] update to pdfbox 3.0.3 2025-01-25 22:19:46 +00:00
changelog.md [mod] memory usage 2025-02-18 16:20:02 +00:00
doc.md [mod] doc 2025-02-18 16:36:21 +00:00
LICENSE [mod] back to v11 2025-02-10 12:24:34 +00:00
package.json [mod] memory usage 2025-02-18 16:20:02 +00:00
readme.md [mod] memory usage 2025-02-18 16:20:02 +00:00

Pdfbox

A BaseX interface for the Apache Pdfbox library version 3.

The Apache PDFBox® library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.

This interface is packaged in the Expath format. The package includes the required Pdfbox jars. A test suite is available and workflow actions run this on BaseX 10.7 and 11.7.

Note

Currently (v0.1.5) works with BaseX 9.7, but this may change with future versions.

Features

The features focus on extracting information from PDFs rather than creation or editing of PDFs.

Supported

  • read PDF page count.
  • read any PDF outline and return as map(s) or XML.
  • read pagelabels.
  • read page text.
  • save pdf page range to a new pdf.
  • save image of rendered pdf page.
  • open PDF with password
  • support for xs:base64Binary in function inputs and outputs to facilitate database and store usage.

Not supported:

  • creating completely new PDFs
  • Page size information
  • XMP processing
  • Form processing

Documentation

Install

Pre-built pdfbox-x.y.z.zar files are available on the releases page. They can be installed using the standard respository functions or using the GUI.

Usage

import module namespace pdfbox="org.expkg_zone58.Pdfbox3";

pdfbox:with-pdf("...path/to/pdf.pdf",
 function($pdf){
  (1 to pdfbox:page-count($pdf))!pdfbox:page-text($pdf,.)
 }
)

Build

  • scripts/make-xar.xq packages the required jars and xqm files to a xar file in the dist folder.

The package.json is used/abused as a configuration source. Non standard information is held in the expkg_zone58 section. This is experimental and may change.

Action support

The workflow ci-basex.yaml builds and tests the package. This can be used as an action on github, or on a local gitea installation.