Skip to content

document

The document.* commands operate on a concrete local path or URL. Use them when you already have a filing page, PDF, investor deck, exhibit, or extracted document URL and need structured text, match offsets, table previews, or OCR fallback.

All document commands return the standard Finance CLI JSON envelope:

{
"ok": true,
"data": {},
"error": null,
"warnings": []
}

Extract native text and layout blocks from a PDF or HTML document.

finance document.read reads a local path or URL and returns extracted text, page metadata, block offsets, truncation metadata, and parser warnings. For HTML inputs it uses BeautifulSoup. For PDF inputs it uses the native PDF parser and does not run OCR.

Use this command for the first pass over a known document. It is the right command when you need the raw extracted text and stable offsets before choosing whether to scan, window, parse tables, or run OCR.

Terminal window
finance document.read SOURCE|source=PATH_OR_URL [format=pdf|html max_chars=12000 max_pages=5]
ArgumentRequiredDefaultAccepted valuesDescription
SOURCEYes, unless source, path, or url is setNoneLocal path or URLPositional document source.
sourceYes, unless positional SOURCE, path, or url is setNoneLocal path or URLKeyword document source.
pathYes, unless another source form is setNoneLocal filesystem pathAlias for source.
urlYes, unless another source form is setNoneHTTP(S) URLAlias for source.
formatNoAuto-detectedpdf, htmlParser override. Use html for SEC filing pages and pdf for PDFs.
max_charsNo12000Integer; 0 means unbounded in current readersMaximum extracted text characters returned.
max_pagesNoAll pagesInteger; 0 means all pagesMaximum PDF pages to process. Ignored for single-page HTML extraction.
Terminal window
finance document.read ./filing.html format=html max_chars=4000 --output json
finance document.read ./deck.pdf max_pages=3 --output json

This output was generated with finance document.read /tmp/financecli_sample.html format=html max_chars=1000 --output json.

{
"ok": true,
"data": {
"source": "/tmp/financecli_sample.html",
"url": "/tmp/financecli_sample.html",
"engine": "beautifulsoup",
"format": "html",
"text": "Sample filing\nTotal current assets were 100.\nOperating lease costs were 12.\nMetric\nValue\nRevenue\n123",
"pages": [
{
"page": 1,
"text": "Sample filing\nTotal current assets were 100.\nOperating lease costs were 12.\nMetric\nValue\nRevenue\n123",
"char_count": 100,
"returned_chars": 100,
"truncated": false,
"blocks": [
{
"index": 0,
"text": "Sample filing\nTotal current assets were 100.\nOperating lease costs were 12.\nMetric\nValue\nRevenue\n123",
"start_char": 0,
"end_char": 100,
"bbox": null
}
]
}
],
"char_count": 100,
"returned_chars": 100,
"truncated": false,
"warnings": [
"HTML extraction returned very short text; document may be a wrapper or script-rendered."
]
},
"error": null,
"warnings": [
"HTML extraction returned very short text; document may be a wrapper or script-rendered."
]
}
FieldTypeDescription
okbooleanWhether the command completed successfully.
dataobject or nullCommand-specific result payload. It is null when ok is false.
errorstring or nullHuman-readable error message when ok is false; otherwise null.
warningsarrayNon-fatal warnings returned by the command.
data.sourcestringSource path or URL passed to the command.
data.urlstringURL/source value used by the HTML reader. Present for HTML reads.
data.enginestringParser engine, such as beautifulsoup or pymupdf.
data.formatstringResolved document format: html or pdf.
data.textstringExtracted text after max_chars truncation.
data.pagesarrayPage-level extraction records. HTML inputs are returned as page 1.
data.pages[].textstringPage text after per-page extraction.
data.pages[].char_countintegerTotal characters found on the page.
data.pages[].returned_charsintegerCharacters returned for the page.
data.pages[].truncatedbooleanWhether page text was truncated.
data.pages[].blocksarrayOffset-bearing text blocks for follow-up matching.
data.pages[].blocks[].start_charintegerStart character offset in the extracted text.
data.pages[].blocks[].end_charintegerEnd character offset in the extracted text.
data.pages[].blocks[].bboxarray or nullBounding box when the parser exposes layout coordinates.
data.char_countintegerTotal extracted character count before truncation.
data.returned_charsintegerTotal characters returned in data.text.
data.truncatedbooleanWhether data.text was truncated by max_chars.
data.warningsarrayParser-specific warnings. These are also copied to the top-level warnings field.

Search extracted document text for topics or literal phrases and return match offsets.

finance document.scan reads the document, searches the extracted text using deterministic matching, and returns matched blocks with match_id, score, page, character offsets, and optional surrounding context.

The command supports fuzzy matching and an all_terms mode for table-style queries where every meaningful query term should be present.

Use this command when you need evidence discovery inside a filing, PDF, or HTML page. It is the command to run before finance document.window when you need stable offsets for follow-up reading.

Terminal window
finance document.scan SOURCE|source=PATH_OR_URL [query=TEXT topics=TOPICS format=pdf|html match=fuzzy|all_terms threshold=80 max_chars=12000 max_pages=5 limit=50 window=0 start_char=0 end_char=0]
ArgumentRequiredDefaultAccepted valuesDescription
SOURCEYes, unless source, path, or url is setNoneLocal path or URLPositional document source.
source / path / urlYes, unless positional SOURCE is setNoneLocal path or URLKeyword source forms.
queryNoNoneTextLiteral query. When set, it becomes the only scan topic.
topicsNoBuilt-in default topicsComma-separated topic names or literal queriesTopic list. Known topics include disclosure, risk, financial_reporting, portfolio, and guidance. Unknown topics are treated as literal queries.
topicNoSame as topicsComma-separated topic names or literal queriesAlias for topics.
formatNoAuto-detectedpdf, htmlParser override.
matchNofuzzyfuzzy, all_termsMatch mode.
thresholdNo80.0NumberMinimum fuzzy score. Use 100 with match=all_terms for strict term coverage.
max_charsNo12000Integer; 0 means unbounded in current readersMaximum extracted characters to scan.
max_pagesNoAll pagesInteger; 0 means all pagesMaximum PDF pages to process.
limitNo50IntegerMaximum matches returned.
windowNo0IntegerAdds surrounding text to each match when greater than zero.
start_charNoNoneIntegerRestricts scanning to blocks overlapping this start offset.
end_charNoNoneIntegerRestricts scanning to blocks overlapping this end offset.
Terminal window
finance document.scan ./filing.html format=html query="operating lease costs" window=80 max_chars=0 --output json
finance document.scan ./report.pdf topics=risk,financial_reporting max_pages=5 --output json
finance document.scan ./filing.html format=html match=all_terms threshold=100 query="Receivables net Total current assets" max_chars=0 --output json

This output was generated with finance document.scan /tmp/financecli_sample.html format=html query="operating lease costs" window=80 max_chars=0 --output json.

{
"ok": true,
"data": {
"source": "/tmp/financecli_sample.html",
"engine": "beautifulsoup",
"format": "html",
"topics": [
"operating lease costs"
],
"threshold": 80.0,
"match_mode": "fuzzy",
"start_char": null,
"end_char": null,
"window_chars": 80,
"matches": [
{
"match_id": "char_0_100",
"topic": "operating lease costs",
"score": 100.0,
"query": "operating lease costs",
"match_mode": "fuzzy",
"page": 1,
"block_index": 0,
"bbox": null,
"start_char": 0,
"end_char": 100,
"snippet": "Sample filing\nTotal current assets were 100.\nOperating lease costs were 12.\nMetric\nValue\nRevenue\n123",
"text": "Sample filing\nTotal current assets were 100.\nOperating lease costs were 12.\nMetric\nValue\nRevenue\n123",
"window_start_char": 0,
"window_end_char": 100
}
],
"count": 1,
"pages_scanned": 1,
"char_count": 100,
"warnings": [
"HTML extraction returned very short text; document may be a wrapper or script-rendered."
]
},
"error": null,
"warnings": [
"HTML extraction returned very short text; document may be a wrapper or script-rendered."
]
}
FieldTypeDescription
okbooleanWhether the command completed successfully.
dataobject or nullCommand-specific result payload. It is null when ok is false.
errorstring or nullHuman-readable error message when ok is false; otherwise null.
warningsarrayNon-fatal warnings returned by the command.
data.sourcestringDocument path or URL passed to the command.
data.enginestringParser engine used before matching.
data.formatstringResolved document format.
data.topicsarray or stringTopics or literal query terms scanned.
data.thresholdnumberFuzzy score threshold used.
data.match_modestringMatching mode: fuzzy or all_terms.
data.start_charinteger or nullLower scan bound when supplied.
data.end_charinteger or nullUpper scan bound when supplied.
data.window_charsintegerContext window requested for each match.
data.matchesarrayMatch records.
data.matches[].match_idstringStable char_START_END identifier that can be passed to document.window.
data.matches[].topicstringTopic or query that produced the match.
data.matches[].scorenumberMatch score.
data.matches[].pageintegerPage number.
data.matches[].block_indexintegerMatched block index on the page.
data.matches[].bboxarray or nullBounding box when available.
data.matches[].start_charintegerStart offset of the matched block.
data.matches[].end_charintegerEnd offset of the matched block.
data.matches[].textstringFull matched block text.
data.matches[].snippetstringContext snippet when window is greater than zero.
data.matches[].window_start_charintegerStart offset of the snippet window.
data.matches[].window_end_charintegerEnd offset of the snippet window.
data.countintegerNumber of matches returned.
data.pages_scannedintegerNumber of pages scanned after offset filtering.
data.char_countintegerTotal extracted characters available to the scanner.
data.warningsarrayParser warnings copied to the top-level warnings field.

Read a bounded text window around a character offset or a document.scan match ID.

finance document.window re-reads the document, locates a character offset, and returns a bounded text window. The anchor can be a raw start_char value or a match_id returned by finance document.scan.

Use this command after document.scan when a match identifies the right part of a filing but you need nearby text for interpretation, citation, or table continuation.

Terminal window
finance document.window SOURCE|source=PATH_OR_URL [format=pdf|html start_char=0|match_id=char_START_END chars=4000 direction=around|next|previous]
ArgumentRequiredDefaultAccepted valuesDescription
SOURCEYes, unless source, path, or url is setNoneLocal path or URLPositional document source.
source / path / urlYes, unless positional SOURCE is setNoneLocal path or URLKeyword source forms.
formatNoAuto-detectedpdf, htmlParser override.
start_charRequired unless match_id is setNoneIntegerCharacter offset used as the anchor.
startRequired unless start_char or match_id is setNoneIntegerAlias for start_char.
match_idRequired unless start_char or start is setNonechar_START_ENDMatch ID returned by document.scan.
charsNo4000Integer greater than 0Maximum window size.
directionNoaroundaround, next, previousWhether to read around the anchor, after the match, or before the anchor. Aliases such as after, forward, prev, before, and back are also accepted.
Terminal window
finance document.window ./filing.html format=html start_char=0 chars=120 --output json
finance document.window ./filing.html format=html match_id=char_52000_52200 direction=next chars=4000 --output json

This output was generated with finance document.window /tmp/financecli_sample.html format=html start_char=0 chars=120 --output json.

{
"ok": true,
"data": {
"source": "/tmp/financecli_sample.html",
"engine": "beautifulsoup",
"format": "html",
"start_char": 0,
"end_char": 60,
"returned_chars": 60,
"char_count": 100,
"direction": "around",
"text": "Sample filing\nTotal current assets were 100.\nOperating lease",
"warnings": [
"HTML extraction returned very short text; document may be a wrapper or script-rendered."
]
},
"error": null,
"warnings": [
"HTML extraction returned very short text; document may be a wrapper or script-rendered."
]
}
FieldTypeDescription
okbooleanWhether the command completed successfully.
dataobject or nullCommand-specific result payload. It is null when ok is false.
errorstring or nullHuman-readable error message when ok is false; otherwise null.
warningsarrayNon-fatal warnings returned by the command.
data.sourcestringDocument path or URL passed to the command.
data.enginestringParser engine used.
data.formatstringResolved document format.
data.start_charintegerStart offset of the returned window.
data.end_charintegerEnd offset of the returned window.
data.returned_charsintegerNumber of characters returned in data.text.
data.char_countintegerTotal extracted document characters.
data.directionstringDirection applied to the anchor.
data.textstringReturned text window.
data.warningsarrayParser warnings copied to the top-level warnings field.

Extract compact table previews from a text-based PDF.

finance document.tables sends a local or remote PDF to the table extraction stack and returns page, table shape, parsing accuracy, whitespace, row previews, and truncation metadata.

Use this command when document.read or document.scan finds a table-heavy PDF and you need row previews instead of plain text windows. It is for text/vector PDFs; scanned image PDFs usually need OCR first.

Terminal window
finance document.tables SOURCE|source=PATH_OR_URL [pages=1-end flavor=stream|lattice max_tables=20 max_rows=25]
ArgumentRequiredDefaultAccepted valuesDescription
SOURCEYes, unless source, path, or url is setNoneLocal path or URLPositional PDF source.
source / path / urlYes, unless positional SOURCE is setNoneLocal path or URLKeyword source forms.
pagesNo1-endCamelot page expression such as 1, 1-3, 1-end, or allPages passed to the table parser.
flavorNostreamstream, latticestream works on whitespace-separated tables. lattice is for ruled-line tables and may require Ghostscript.
max_tablesNo20IntegerMaximum detected tables to return.
max_rowsNo25IntegerMaximum preview rows per table.
Terminal window
finance document.tables ./report.pdf pages=10-12 flavor=stream --output json
finance document.tables ./filing.pdf pages=all max_tables=5 max_rows=10 --output json

This output was generated with finance document.tables /tmp/financecli_table.pdf pages=1 flavor=stream max_tables=3 max_rows=5 --output json.

{
"ok": true,
"data": {
"source": "/tmp/financecli_table.pdf",
"engine": "camelot",
"format": "pdf",
"pages": "1",
"flavor": "stream",
"tables": [
{
"table": 1,
"page": "1",
"shape": [
4,
1
],
"accuracy": 100.0,
"whitespace": 0.0,
"rows": [
[
"Metric Value"
],
[
"Revenue 123"
],
[
"Gross Profit 45"
],
[
"Operating Income 12"
]
],
"returned_rows": 4,
"truncated": false
}
],
"count": 1,
"total_detected": 1,
"warnings": []
},
"error": null,
"warnings": []
}
FieldTypeDescription
okbooleanWhether the command completed successfully.
dataobject or nullCommand-specific result payload. It is null when ok is false.
errorstring or nullHuman-readable error message when ok is false; otherwise null.
warningsarrayNon-fatal warnings returned by the command.
data.sourcestringPDF path or URL passed to the command.
data.enginestringTable extraction engine.
data.formatstringAlways pdf for this command.
data.pagesstringPage expression passed to the parser.
data.flavorstringParser flavor used.
data.tablesarrayTable preview rows.
data.tables[].tableintegerOne-based table index in returned results.
data.tables[].pagestringPage where the table was detected.
data.tables[].shapearrayTable shape as [rows, columns].
data.tables[].accuracynumber or nullParser accuracy score when available.
data.tables[].whitespacenumber or nullParser whitespace score when available.
data.tables[].rowsarrayReturned row preview.
data.tables[].returned_rowsintegerNumber of preview rows returned.
data.tables[].truncatedbooleanWhether the row preview was truncated by max_rows.
data.countintegerNumber of tables returned.
data.total_detectedintegerTotal tables detected before max_tables truncation.
data.warningsarrayParser warnings.

Run OCR/layout parsing on a PDF or document that native extraction cannot read well.

finance document.ocr runs the default OCR stack and returns extracted text, optional markdown, page-level OCR text, blocks, character counts, truncation metadata, and warnings.

Use this command as a fallback for scanned PDFs, image-heavy investor decks, or documents where document.read returns too little text. For text-based PDFs, prefer document.read, document.scan, or document.tables first.

Terminal window
finance document.ocr SOURCE|source=PATH_OR_URL [max_chars=12000 max_pages=5]
ArgumentRequiredDefaultAccepted valuesDescription
SOURCEYes, unless source, path, or url is setNoneLocal path or URLPositional document source.
source / path / urlYes, unless positional SOURCE is setNoneLocal path or URLKeyword source forms.
max_charsNo12000IntegerMaximum OCR text characters returned.
max_pagesNoAll pagesInteger; 0 means all pagesMaximum pages to OCR.
Terminal window
finance document.ocr ./deck.pdf max_pages=3 --output json
finance document.ocr ./deck.pdf max_chars=4000 --output json

This output was generated with finance document.ocr /tmp/financecli_table.pdf max_pages=1 max_chars=1000 --output json.

{
"ok": true,
"data": {
"source": "/tmp/financecli_table.pdf",
"engine": "paddleocr_pp_structure_v3",
"format": "pdf",
"text": "Metric\nValue\nRevenue\n123\nGross Profit 45\nOperating Income 12",
"markdown": null,
"pages": [
{
"page": 1,
"text": "Metric\nValue\nRevenue\n123\nGross Profit 45\nOperating Income 12",
"markdown": "",
"char_count": 60,
"returned_chars": 60,
"truncated": false,
"blocks": [
{
"type": "text",
"text": "Metric Value Revenue 123Gross Profit 45Operating Income 12"
}
]
}
],
"char_count": 60,
"returned_chars": 60,
"truncated": false,
"warnings": []
},
"error": null,
"warnings": []
}
FieldTypeDescription
okbooleanWhether the command completed successfully.
dataobject or nullCommand-specific result payload. It is null when ok is false.
errorstring or nullHuman-readable error message when ok is false; otherwise null.
warningsarrayNon-fatal warnings returned by the command.
data.sourcestringDocument path or URL passed to the command.
data.enginestringOCR engine, usually paddleocr_pp_structure_v3.
data.formatstringResolved document format.
data.textstringOCR text after truncation.
data.markdownstring or nullMarkdown output when the OCR stack provides it.
data.pagesarrayPage-level OCR records.
data.pages[].pageintegerOne-based page number.
data.pages[].textstringOCR text for the page.
data.pages[].markdownstring or nullPage markdown when available.
data.pages[].char_countintegerPage OCR character count before truncation.
data.pages[].returned_charsintegerCharacters returned for the page.
data.pages[].truncatedbooleanWhether page text was truncated.
data.pages[].blocksarrayOCR/layout blocks emitted by the parser.
data.char_countintegerTotal OCR character count before truncation.
data.returned_charsintegerTotal characters returned.
data.truncatedbooleanWhether data.text was truncated.
data.warningsarrayOCR warnings.