2020년 5월 13일 수요일

PDFminer - How to use PDFminer and PDFminer Exmaple

What is PDFminer?

When you want to convert pdf into a html or text file format that can easily access it, is it reluctant to use the conversion service provided on the web in the meantime, and is there any way to make it easy to do Python coding? When you want to. I think it's PDFMiner that eventually will be where you're going to arrive. It also provides code that can be used immediately after installing the module, so it will not be a big inconvenience to use without paying much attention to the speed. Indeed, the PDFMiner module is only available in the Python 2 version.

How to use PDFminer

1. PDFminer Install

pip install pdfminer

python setup.py install


2. PDFminer Example

There are codes provided within the source code. You can convert a pdf document without coding immediately by running a Python code called pdf2txt.py inside the Tools folder within the source code folder.
pdf2txt.py -o output.html samples/pdf.pdf
-o Filename to be converted.html
 -o Filename to be converted.txt 

The extension of the file name to be converted will be read on its own and created html or txt file to suit its format. That pdf2txt.If you look directly at the py, you'll see how it works.

3. How to use PDFminer

pdf2txt that allows you to convert pdf with a single line of command.A little look at the pie source code will allow me to write my own Python code that can convert pdf to the settings I want. The following code converts the pdf file directly into a pdf2output.html file by selecting it directly as a file browser.
# -*- coding: utf-8 -*-
from tkFileDialog import askopenfilename
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from pdfminer.image import ImageWriter

password = ''
pagenos = set()
maxpages = 0
# output option
outfile = None
outtype = None
imagewriter = None
rotation = 0
layoutmode = 'normal'
codec = 'utf-8'
codec = 'euc-kr'
pageno = 1
scale = 1
caching = True
showpageno = True
laparams = LAParams()

fpname = askopenfilename()
fp = file(fpname, 'rb')

outfpname = 'pdf2output'

rsrcmgr = PDFResourceManager(caching=caching)
outfp = file(outfpname + '.txt', 'w')
device = TextConverter(rsrcmgr, outfp, codec=codec, 
                       laparams=laparams, imagewriter=imagewriter)

interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp, pagenos,
                              maxpages=maxpages, password=password,
                              caching=caching, check_extractable=True):
    page.rotate = (page.rotate+rotation) % 360
    interpreter.process_page(page)
    
outfp.close()

rsrcmgr = PDFResourceManager(caching=caching)
outfp = file(outfpname + '.html', 'w')
device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale,
                       layoutmode=layoutmode, laparams=laparams,
                       imagewriter=imagewriter)

interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp, pagenos,
                              maxpages=maxpages, password=password,
                              caching=caching, check_extractable=True):
    page.rotate = (page.rotate+rotation) % 360
    interpreter.process_page(page)

outfp.close()

fp.close()

댓글 없음:

댓글 쓰기