public class PDFMarkedContentExtractor extends PDFStreamEngine
Constructor and Description |
---|
PDFMarkedContentExtractor()
Instantiate a new PDFTextStripper object.
|
PDFMarkedContentExtractor(String encoding)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
beginMarkedContentSequence(COSName tag,
COSDictionary properties)
Called when a marked content group begins
|
protected float |
computeFontHeight(PDFont font)
Compute the font height.
|
void |
endMarkedContentSequence()
Called when a marked content group ends
|
List<PDMarkedContent> |
getMarkedContents() |
boolean |
isSuppressDuplicateOverlappingText() |
void |
processPage(PDPage page)
This will initialize and process the contents of the stream.
|
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the
text to the list of characters on a page.
|
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
By default the class will attempt to remove text that overlaps each other.
|
protected void |
showGlyph(Matrix textRenderingMatrix,
PDFont font,
int code,
String unicode,
Vector displacement)
Called when a glyph is to be processed.
|
void |
xobject(PDXObject xobject) |
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
public PDFMarkedContentExtractor() throws IOException
IOException
public PDFMarkedContentExtractor(String encoding) throws IOException
encoding
- The encoding that the output will be written in.IOException
public boolean isSuppressDuplicateOverlappingText()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
suppressDuplicateOverlappingText
- The suppressDuplicateOverlappingText setting to set.public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
PDFStreamEngine
beginMarkedContentSequence
in class PDFStreamEngine
tag
- indicates the role or significance of the sequenceproperties
- optional propertiespublic void endMarkedContentSequence()
PDFStreamEngine
endMarkedContentSequence
in class PDFStreamEngine
public void xobject(PDXObject xobject)
protected void processTextPosition(TextPosition text)
text
- The text to process.public List<PDMarkedContent> getMarkedContents()
public void processPage(PDPage page) throws IOException
processPage
in class PDFStreamEngine
page
- the page to processIOException
- if there is an error accessing the stream.protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
showGlyph
in class PDFStreamEngine
textRenderingMatrix
- the current text rendering matrix, Trmfont
- the current fontcode
- internal PDF character code for the glyphunicode
- the Unicode text for this glyph, or null if the PDF does provide itdisplacement
- the displacement (i.e. advance) of the glyph in text spaceIOException
- if the glyph cannot be processedprotected float computeFontHeight(PDFont font) throws IOException
font
- the font.IOException
- if there is an error while getting the font bounding box.Copyright © 2002–2023 The Apache Software Foundation. All rights reserved.