PDF Files Handling

PDF Files Handling in Python

PDF is a widely-used document format for digital publications. Python, on the other hand, is a versatile programming language with a vast range of applications in today's digital world. When used together, Python can become an efficient tool in manipulating and extracting information from PDF documents. In this article, we will explore the different ways Python can be used for PDF processing, and how it can help us improve our productivity and efficiency.

Python PDF Libraries

To work with PDF files in Python, there are various libraries available. Some of the popular libraries to use Python with PDF are PyPDF2, reportlab, and fpdf.

Reading PDF with Python

To read a PDF file , you can use the PyPDF2 library. Here's an example:

import json import PyPDF2 # Open the PDF file pdf_file = open('example.pdf', 'rb') # Create a PDF reader object pdf_reader = PyPDF2.PdfFileReader(pdf_file) # Get the number of pages in the PDF file num_pages = pdf_reader.numPages # Loop through all the pages and extract the text for page in range(num_pages): page_obj = pdf_reader.getPage(page) print(page_obj.extractText()) # Close the PDF file pdf_file.close()

Generating PDF with Python

To generate new PDF files from scratch, you can use the reportlab or fpdf library. Here's an example using reportlab :

from reportlab.pdfgen import canvas # Create a new PDF file pdf_file = canvas.Canvas('example.pdf') # Add text to the PDF pdf_file.drawString(100, 750, "Hello World") # Save and close the PDF file pdf_file.save()

Similarly, you can use fpdf library to create PDF.

Editing PDF with Python

To edit existing PDF files, you can use PyPDF2 library. Here's an example to rotate the pages in a PDF file:

import PyPDF2 # Open the PDF file pdf_file = open('example.pdf', 'rb') # Create a PDF reader object pdf_reader = PyPDF2.PdfFileReader(pdf_file) # Create a PDF writer object pdf_writer = PyPDF2.PdfFileWriter() # Rotate the pages and add them to the PDF writer for page in range(pdf_reader.numPages): page_obj = pdf_reader.getPage(page) page_obj.rotateClockwise(90) pdf_writer.addPage(page_obj) # Save the rotated PDF file with open('example_rotated.pdf', 'wb') as pdf_output: pdf_writer.write(pdf_output) # Close the PDF files pdf_file.close() pdf_output.close()

In summary, Python provides multiple libraries to work with PDF files, enabling you to read, generate, and edit PDFs programmatically.

How to Extract Text from PDF with Python

To extract text from a PDF with Python, you can use the PyPDF2 or pdfminer libraries. These libraries allow you to parse the PDF and extract the text content.

Example 1: Using PyPDF2

import PyPDF2 pdf_file = open('file.pdf', 'rb') pdf_reader = PyPDF2.PdfFileReader(pdf_file) text = '' for page_num in range(pdf_reader.numPages): page = pdf_reader.getPage(page_num) text += page.extractText() print(text)

Example 2: Using pdfminer

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def pdf_to_text(pdf_path): manager = PDFResourceManager() output = StringIO() converter = TextConverter(manager, output, laparams=LAParams()) interpreter = PDFPageInterpreter(manager, converter) with open(pdf_path, 'rb') as file: for page in PDFPage.get_pages(file, check_extractable=True): interpreter.process_page(page) text = output.getvalue() return text

Both of these methods will allow you to extract text content from a PDF with Python.

How to Combine PDF Pages

Merging multiple PDF files into a single document is a common task in document processing. The PyPDF2 library in Python makes it easy to merge multiple PDF files into a single document.

Merge Two PDF Pages Using PyPDF2

import PyPDF2 # Open the first PDF file pdf1 = PyPDF2.PdfFileReader(open('file1.pdf', 'rb')) # Open the second PDF file pdf2 = PyPDF2.PdfFileReader(open('file2.pdf', 'rb')) # Merge the two PDF files output = PyPDF2.PdfFileWriter() output.addPage(pdf1.getPage(0)) output.addPage(pdf2.getPage(0)) # Save the merged PDF file with open('merged.pdf', 'wb') as f: output.write(f)

Merge entire PDF files Using PyPDF2

from PyPDF2 import PdfFileMerger pdfs = ['file1.pdf', 'file2.pdf'] merger = PdfFileMerger() for pdf in pdfs: merger.append(open(pdf, 'rb')) with open('merged_pdf.pdf', 'wb') as f: merger.write(f)

Using the above code examples, you can merge multiple PDF pages or entire PDF files in Python using the PyPDF2 library. By combining PDF files, you can easily create a single document that is easier to manage and distribute.

How to Remove Watermark from PDF

Removing watermark from PDF files in Python is easy and can be done using a number of libraries. Here are some solutions to remove watermarks using PyPDF2 and PyMuPDF libraries.

# Solution 1 import PyPDF2 # Open the PDF file pdf = open('filename.pdf', 'rb') # Create a PDFReader object pdf_reader = PyPDF2.PdfReader() # Create a PDFWriter object pdf_writer = PyPDF2.PdfWriter() # Iterate over the pages in the PDF file for page in pdf_reader: # Remove the watermark page.mergePage(None) # Add the page to the PDFWriter object pdf_writer.addPage(page) # Save the PDF with the watermark removed with open('filename_nw.pdf', 'wb') as f: pdf_writer.write(f)

import fitz # Solution 2 # Open the PDF file pdf = fitz.open('filename.pdf') # Iterate over the pages in the PDF file for page in pdf: # Get the annotations on the page annotations = page.annots() # Iterate through the annotations for annotation in annotations: # Check if the annotation is a watermark if annotation.type[0] == 8: # Remove the annotation page.deleteAnnot(annotation) # Save the PDF with the watermark removed pdf.save('filename_nw.pdf')

With these simple solutions, you can easily remove watermarks from PDF files using Python and the PyPDF2 and PyMuPDF libraries.

How to convert HTML to PDF

Converting HTML to PDF is a common task in web development. Fortunately, Python provides several libraries to accomplish this task effortlessly. Here are two examples of how to convert HTML to PDF using popular Python libraries:

Using the pdfkit library

import pdfkit pdfkit.from_file('path/to/file.html', 'path/to/output.pdf')

Using the weasyprint library

from weasyprint import HTML HTML('path/to/file.html').write_pdf('path/to/output.pdf')

Both libraries provide the ability to convert HTML to PDF with just a few lines of code, making it easy to incorporate into any Python project. Don't forget to install the required libraries using pip before implementing the solution.

Contribute with us!

Do not hesitate to contribute to Python tutorials on GitHub: create a fork, update content and issue a pull request.