반응형

https://pandas.pydata.org/pandas-docs/stable/index.html

 

pandas documentation — pandas 2.2.1 documentation

API reference The reference guide contains a detailed description of the pandas API. The reference describes how the methods work and which parameters can be used. It assumes that you have an understanding of the key concepts.

pandas.pydata.org

 

Community tutorials — pandas 2.2.1 documentation

 

pandas.pydata.org

 

반응형
반응형

[python] PyMuPDF 에서 해상도 올리기. PDF to IMG

How to Increase Image Resolution

 

https://pymupdf.readthedocs.io/en/latest/recipes-images.html

 

Images - PyMuPDF 1.23.25 documentation

Previous Text

pymupdf.readthedocs.io

The image of a document page is represented by a Pixmap, and the simplest way to create a pixmap is via method Page.get_pixmap().

This method has many options to influence the result. The most important among them is the Matrix, which lets you zoom, rotate, distort or mirror the outcome.

Page.get_pixmap() by default will use the Identity matrix, which does nothing.

In the following, we apply a zoom factor of 2 to each dimension, which will generate an image with a four times better resolution for us (and also about 4 times the size):

zoom_x = 2.0  # horizontal zoom
zoom_y = 2.0  # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y)  # zoom factor 2 in each dimension
pix = page.get_pixmap(matrix=mat)  # use 'mat' instead of the identity matrix

dpi = 600
pix = page.get_pixmap(dpi)

Since version 1.19.2 there is a more direct way to set the resolution: Parameter "dpi" (dots per inch) can be used in place of "matrix". To create a 300 dpi image of a page specify pix = page.get_pixmap(dpi=300). Apart from notation brevity, this approach has the additional advantage that the dpi value is saved with the image file – which does not happen automatically when using the Matrix notation.

 
반응형
반응형

djangorestframework 3.14.0

 

 

https://www.django-rest-framework.org/

 


*** 참고 : https://towardsdatascience.com/designing-and-deploying-a-machine-learning-python-application-part-2-99eb37787b2b

https://pypi.org/project/djangorestframework/

pip install djangorestframework

Requirements Python 3.6+ Django 4.1, 4.0, 3.2, 3.1, 3.0 We highly recommend and only officially support the latest patch release of each Python and Django series.

Installation Install using pip...

pip install djangorestframework
Add 'rest_framework' to your INSTALLED_APPS setting.

INSTALLED_APPS = [
    ...
    'rest_framework',
]

 

Designing and Deploying a Machine Learning Python Application (Part 2)

You don’t have to be Atlas to get your model into the cloud

towardsdatascience.com

 

 

 

 

 

반응형
반응형

https://wikidocs.net/book/2070

 

Python 계단밟기

## Python을 배워보자 > 문제를 풀다보면 문법은 자동으로 알게된다. > 문법을 배우는것이 프로그램을 배우는것이 아니라 어떻게 활용하는냐가 중요하다

wikidocs.net

 

반응형
반응형

Matplotlib Tutorial - 파이썬으로 데이터 시각화하기

 

https://wikidocs.net/book/5011

 

Matplotlib Tutorial - 파이썬으로 데이터 시각화하기

## 도서 소개 - 이 책은 파이썬의 대표적인 데이터 시각화 라이브러리인 Matplotlib의 사용법을 소개합니다. - 30여 개 이상의 다양한 주제에 대해 100개…

wikidocs.net

Matplotlib 데이터 시각화와 2D 그래프 플롯에 사용되는 파이썬 라이브러리입니다.

Matplotlib을 이용하면 아래 그림과 같이 다양한 유형의 그래프를 간편하게 그릴 수 있습니다.

 

 

Matplotlib의 간단한 사용법을 소개하고, 예제와 함께 다양한 그래프를 그려봅니다.

예제들은 Matplotlib 공식 홈페이지를 참고해서 만들었습니다.

순서는 아래와 같습니다.

 

Contents

00. Matplotlib 설치하기
01. Matplotlib 기본 사용
02. Matplotlib 숫자 입력하기
03. Matplotlib 축 레이블 설정하기
04. Matplotlib 범례 표시하기
05. Matplotlib 축 범위 지정하기
06. Matplotlib 선 종류 지정하기
07. Matplotlib 마커 지정하기
08. Matplotlib 색상 지정하기
09. Matplotlib 그래프 영역 채우기
10. Matplotlib 축 스케일 지정하기
11. Matplotlib 여러 곡선 그리기
12. Matplotlib 그리드 설정하기
13. Matplotlib 눈금 표시하기
14. Matplotlib 타이틀 설정하기
15. Matplotlib 수평선/수직선 표시하기
16. Matplotlib 막대 그래프 그리기
17. Matplotlib 수평 막대 그래프 그리기
18. Matplotlib 산점도 그리기
19. Matplotlib 3차원 산점도 그리기
20. Matplotlib 히스토그램 그리기
21. Matplotlib 에러바 그리기
22. Matplotlib 파이 차트 그리기
23. Matplotlib 히트맵 그리기
24. Matplotlib 여러 개의 그래프 그리기
25. Matplotlib 컬러맵 설정하기
26. Matplotlib 텍스트 삽입하기
27. Matplotlib 수학적 표현 사용하기
28. Matplotlib 그래프 스타일 설정하기
29. Matplotlib 이미지 저장하기
30. Matplotlib 객체 지향 인터페이스 1
31. Matplotlib 객체 지향 인터페이스 2
32. Matplotlib 축 위치 조절하기
33. Matplotlib 이중 Y축 표시하기
34. Matplotlib 두 종류의 그래프 그리기
35. Matplotlib 박스 플롯 그리기
36. Matplotlib 바이올린 플롯 그리기
37. Matplotlib 다양한 도형 삽입하기
38. Matplotlib 다양한 패턴 채우기

 

반응형
반응형

 

The Go programming language enters the top 10

 

TIOBE Index for February 2024

 

https://www.tiobe.com/tiobe-index/

 

TIOBE Index - TIOBE

Home » TIOBE Index TIOBE Index for February 2024 February Headline: The Go programming language enters the top 10 This month, Go entered the TIOBE index top 10 at position 8. This is the highest position Go has ever had. When it was launched by Google in

www.tiobe.com

 

반응형
반응형

 

지정 폴더안의 이미지 전부  텍스트 추출하기

# 파이썬 컴파일 경로가 달라서 현재 폴더의 이미지를 호출하지 못할때 작업디렉토리를 변경한다. 
import os
from pathlib import Path
# src 상위 폴더를 실행폴더로 지정하려고 한다.
###real_path = Path(__file__).parent.parent
real_path = Path(__file__).parent
print(real_path)
#작업 디렉토리 변경
os.chdir(real_path) 

"""_summary_
pip install pillow
pip install pytesseract



다운 받아야하는 학습된 한글 데이터 파일명: kor.traineddata
파일 위치: tesseract가 설치된 경로 C:\Program Files\Tesseract-OCR\tessdata

"""



from PIL import Image
import pytesseract  
import cv2 
import matplotlib.pyplot as plt

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
config = ('-l kor+eng --oem 3 --psm 11')
#config = ('-l kor+eng')
directory_base = str(real_path)+"./img/"  # 경로object를 문자열로 변경해서 합친다. 

        
# Open an image file
image_path = directory_base+"03_kor_eng.png"  # Replace with your image file path
img = Image.open(image_path)

# Use Tesseract to extract text
text = pytesseract.image_to_string(img, config=config)

print("Extracted Text:" + text)

image = cv2.imread(image_path)
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

plt.imshow(rgb_image)

# use Tesseract to OCR the image 
# text = pytesseract.image_to_string(rgb_image, lang='kor+eng')
text = pytesseract.image_to_string(rgb_image, config=config)
print(text)


if __name__ == "__main__":
     
    # List all files in the directory
    file_list = [f for f in os.listdir(directory_base) if os.path.isfile(os.path.join(directory_base, f))]

    # Print the list of files
    for file in file_list:
        print(file)
        # Open an image file
        image_path = directory_base + file  # Replace with your image file path
        img = Image.open(image_path)

        text = pytesseract.image_to_string(img, config=config)
        print("Extracted Text:")
        print(text)

[python] 이미지에서 텍스트 추출하기,  tesseract, OCR

 

 

반응형
반응형

pip install pytesseract

 

한글팩 : https://github.com/tesseract-ocr/tessdata/

 
다운 받아야하는 학습된 한글 데이터 파일명: kor.traineddata
파일 위치: tesseract가 설치된 경로 C:\Program Files\Tesseract-OCR\tessdata
 

*** 설치 할때 언어팩 선택 

 

pytesseract 0.3.10

 

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

USAGE

Quickstart

Note: Test images are located in the tests/data folder of the Git repo.

Library usage:

from PIL import Image

import pytesseract

# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'

# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))

# In order to bypass the image conversions of pytesseract, just use relative or absolute image path
# NOTE: In this case you should provide tesseract supported images or tesseract will return error
print(pytesseract.image_to_string('test.png'))

# List of available languages
print(pytesseract.get_languages(config=''))

# French text image to string
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

# Batch processing with a single file containing the list of multiple image file paths
print(pytesseract.image_to_string('images.txt'))

# Timeout/terminate the tesseract job after a period of time
try:
    print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds
    print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second
except RuntimeError as timeout_error:
    # Tesseract processing is terminated
    pass

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open('test.png')))

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get ALTO XML output
xml = pytesseract.image_to_alto_xml('test.png')

Support for OpenCV image/NumPy array objects

import cv2

img_cv = cv2.imread(r'/<path_to_image>/digits.png')

# By default OpenCV stores images in BGR format and since pytesseract assumes RGB format,
# we need to convert from BGR to RGB format/mode:
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))
# OR
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb))

If you need custom configuration like oem/psm, use the config keyword.

# Example of adding any additional options
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

# Example of using pre-defined tesseract config file with options
cfg_filename = 'words'
pytesseract.run_and_get_output(image, extension='txt', config=cfg_filename)

Add the following config, if you have tessdata error like: “Error opening data file…”

# Example config: r'--tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata"'
# It's important to add double quotes around the dir path.
tessdata_dir_config = r'--tessdata-dir "<replace_with_your_tessdata_dir_path>"'
pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

Functions

  • get_languages Returns all currently supported languages by Tesseract OCR.
  • get_tesseract_version Returns the Tesseract version installed in the system.
  • image_to_string Returns unmodified output as string from Tesseract OCR processing
  • image_to_boxes Returns result containing recognized characters and their box boundaries
  • image_to_data Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
  • image_to_osd Returns result containing information about orientation and script detection.
  • image_to_alto_xml Returns result in the form of Tesseract’s ALTO XML format.
  • run_and_get_output Returns the raw output from Tesseract OCR. Gives a bit more control over the parameters that are sent to tesseract.

Parameters

image_to_data(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0, pandas_config=None)

  • image Object or String - PIL Image/NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.
  • lang String - Tesseract language code string. Defaults to eng if not specified! Example for multiple languages: lang='eng+fra'
  • config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'
  • nice Integer - modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.
  • output_type Class attribute - specifies the type of the output, defaults to string. For the full list of all supported types, please check the definition of pytesseract.Output class.
  • timeout Integer or Float - duration in seconds for the OCR processing, after which, pytesseract will terminate and raise RuntimeError.
  • pandas_config Dict - only for the Output.DATAFRAME type. Dictionary with custom arguments for pandas.read_csv. Allows you to customize the output of image_to_data.

CLI usage:

pytesseract [-l lang] image_file

INSTALLATION

Prerequisites:

  • Python-tesseract requires Python 3.6+
  • You will need the Python Imaging Library (PIL) (or the Pillow fork). Under Debian/Ubuntu, this is the package python-imaging or python3-imaging.
  • Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as tesseract. If this isn’t the case, for example because tesseract isn’t in your PATH, you will have to change the “tesseract_cmd” variable pytesseract.pytesseract.tesseract_cmd. Under Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users. please install homebrew package tesseract.
  • Note: In some rare cases, you might need to additionally install tessconfigs and configs from tesseract-ocr/tessconfigs if the OS specific package doesn’t include them.
Installing via pip:

Check the pytesseract package page for more information.

pip install pytesseract
Or if you have git installed:
pip install -U git+https://github.com/madmaze/pytesseract.git
Installing from source:
git clone https://github.com/madmaze/pytesseract.git
cd pytesseract && pip install -U .
Install with conda (via conda-forge):
conda install -c conda-forge pytesseract

TESTING

To run this project’s test suite, install and run tox. Ensure that you have tesseract installed and in your PATH.

pip install tox
tox

반응형

+ Recent posts