### 데이터 과학자를 위한 6가지 Python 팁
### Top 6 Python Tips for Data Scientists
### https://towardsdatascience.com/top-6-python-tips-for-data-scientists-4f4a25e44d15
codeString = '''a,b = 4,5; print(f"a = {a} and b = {b}"); print(f"a+b = {a+b}")'''
exec(codeString)
print("\n\n\n")
import os
import sys
#작업하는 경로(위치)가 어디인지 확인
print(os.getcwd())
exec(open("./Project/Datascientist/myFullFileName.py").read())
print("\n\n\n")
### 각각 현재 작업 디렉토리 또는 사용자 지정 디렉토리의 모든 파일을 나열합니다.
print( os.listdir() )
"""_summary_
"""
print("\n\n\n")
### 4. Code timer as a decorator
import time
import requests
def timerWrapper(func):
"""Code the timer"""
def timer(*args, **kwargs):
"""Start timer"""
start = time.perf_counter()
output = func(*args, **kwargs)
timeElapsed = time.perf_counter() - start
print(f"Current function: {func.__name__}\n Run Time: {timeElapsed}")
return output
return timer
## Func to make a request to an user-defined url
@timerWrapper
def getArtile(url):
return requests.get(url, allow_redirects=True)
## Monitor the runTime
if __name__ == "__main__":
getArtile('https://towardsdatascience.com/6-sql-tricks-every-data-scientist-should-know-f84be499aea5')
print("\n\n\n")
## 이제 다른 함수의 시간을 측정 @timeWrapper하려면 함수 앞에 the를 놓는 것뿐입니다.
@timerWrapper
def getMultiplication(num):
for val in range(num):
print(10**(10**val))
getMultiplication(3)
Scenario: You inherited a Python project from a colleague, and immediately noticed that those scripts all have a whopping 5000+ lines of code. The same chunks (of code) got copied and pasted multiple times! So, is there a more efficient option to go about code reusability?
Let’s explore theexec()function in Python. Simply put, it takes in a string or object code, and execute it as shown in this example,
a = 4 and b = 5
a+b = 9
Even more handy? We can useexec(open().read())to call and execute a file within the Python interpreter. For example,
With this powerful one-liner, data scientists can save programs that will be reused as standalone files, and execute them whenever needed within the main program. No code copying and pasting any more!
Being a cool functionality in Python,exec()has one pitfall to avoid — it does NOT return any value,
a = 4 and b = 5
a+b = 9
** Is the return from exec() is None? True **
As we can see, the output of theexec()function isNone; hence, it cannot be used to store any values, which is equivalent to thesounce()function in R.
2. Operating system commands with {os} and {shutil}
Scenario: continue from our previous tip, now you want to check out the script before executing it. Don’t bother to double-click your mouse all the way through to open up the file? No problem, you can easily achieve this in Python directly without interrupting your train of thought.
Here, theos.startfile()function allows users to open up any type of files, including MS documents, Excels, R and SQL scripts.
Similarly, we can also delete a FILE usingos.remove(“myFullFileName.ANYFORMAT”)
or delete the entire DIRECTORY usingshutil.rmtree(“folderToBeRemoved”). where {shutil} is a Python module that offers a number of high-level file operations, particularly for file copying and removal.
Therefore, if you haven’t used {os} other thanos.getcwd()oros.chdir()or if the {shutil} sounds unfamiliar, it’s time to check out their documents. You will definitely find useful commands or file system methods that make your coding easier. Here lists a few of my favorites,
os.listdir()oros.listdir(“someDirectory”)— list all files in the currently working directory or any user-specified directory, respectively;
os.path.join()— automatically create a path with elements in the arguments for later use, e.g.,os.path.join(‘D’, ‘Medium’, ‘New Folder’)will return
‘D\\Medium\\New Folder’
os.makedirs()— create a directory;
shutil.copy2(“sourcePath”, “destinationPath”)orshutil.move(“sourcePath”, “destinationPath”)— copy or cut a file respectively.
3. One-liner: Nested list comprehension to get rid of the for loops
Scenario: this “simple” task we come across is to combine several lists into one big long list,
Surely, we can write five nestedforloops to append each sublist to the final output list. But it’s smarter to turn to nested list comprehension for the most concise way,
Scenario: while Python is recognized as one of the most effective programming languages, data scientists still need to check the runtime of our programs.
It’s not the hardest thing if we just implement a bare-bones Python timer for each function we want to monitor. However, if we code it as a decorator, we would make our timer much easier to be version-controlled and reused!
Here is how,
In this snippet,
the timer is wrapped in atimerWrapperfunction, which then is used as a decorator called prior to the main function;
The example main function is to return a request connecting to an URL, which isone of my previous blogs.
Running this code gives us the time elapsed,
Current function: getArtile
Run Time: 1.6542516000008618Out[101]: <Response [200]>
Now, to time another function, all we need is to put the@timeWrapperin front of the function,
getMultiplication(3)
10
10000000000
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Current function: getMultiplication
Run Time: 0.00014700000065204222
5. Leverage theoptionssystem to customize your display
Scenario: as data scientists, we analyze data with {pandas} and {numpy} on a daily basis. When I first learned coding in Python, I was frustrated seeing this after reading my data into the IDE,
Clearly, data display is cut off both row-wise and column-wise, and the following code can fix it,
Here, we are explicitly setting the maximum columns, rows and column width to display/print in the console. There are numerous customizableoptions and settingsin {pandas}, and similar operations are also available in{numpy}for arrays and matrix,
6. Reproduce your machine learning model results? Set seeds!
Scenario: Due to the stochastic nature of machine learning modeling process, we have all encountered the non-deterministic aspect of machine learning. This randomness results in our difficulty reproducing the same results across different runs. Consequently, it’s challenging to figure out whether an improvement in performance metrics is a result of successful model tuning or simply a different random training/testing sample.
Luckily the reproducibility can be achieved bysetting the random seedthroughout your model pipeline, provided that you do it correctly! How many times have you seen questions like “Getting a different result despite random seed defined” popping up on Stack Overflow? Well, how to appropriately set seeds should be in the first page of documentations, but it took me some time to dig it out.
I found that NOT every seed is defined the same in {numpy}, {sklearn}, {tensorflow} or {torch}. Thus, it’s a best practice to use a definitive function that sets all SEEDS for all your frameworks. For example,
Adding this tacticalreset_random_seed()function to all necessary steps of your workflow, such as train-test split, model compile/training, and interpretation, will get you half way to full reproducibility. More detailed visibility into your experiments will finish the second half!
import os
import sys
print(" os 환경변수 environ \n")
print(" * 모든 시스템 환경 변수 ")
print(" os.environ() : ", os.environ)
print('')
for key, value in os.environ.items():
print('{}: {}'.format(key, value))
print(" * 특정 시스템 환경 변수 ")
print(" os.environ('JAVA_HOME') : ", os.environ['JAVA_HOME'])
# 존재하지 않는 환경 변수를 가져오면, 에러는 발생하지 않고 None이 리턴
result = os.environ.get('NOT_EXISTS')
print(result)
print(" * 현재 코드가 실행되는 디렉토리까지 문자열 ")
print(" os.getcwd() : ", os.getcwd())
print(" * 현재 실행 파이썬 스크립트의 PID 출력 ")
print(" os.getpid() : ", os.getpid() )
print(" * 현재 워킹 디렉토리 변경 ")
print(" os.chdir() : " )
print(" import os ")
print(" print(os.getcwd()) #/Users/projects/workspace ")
print(" os.chdir( os.getcwd()+'/scripts/src' ) ")
print(" print(os.getcwd()) #/Users/projects/workspace/scripts/src ")
print(" * 디렉토리 만들기 : mkdir(path[,mode]) ")
#_result = os.mkdir("test_dir")
#print(_result)
print(" * path에 존재하는 파일과 디렉토리들의 리스트 반환 ")
_result = os.listdir(".")
print(" os.listdir('.') : ", _result )
print('')
for x in _result:
print('{}'.format(x))
# 인자로 전달된 디렉토리를 재귀적 생성
# - 이미 **디렉토리가 생성**되어 있는 경우나 **권한이 없어 생성할 수 없는 경우**는 **예외**발생
print(" * 디렉토리 만들기 : makedirs(path[,mode]) 이미 **디렉토리가 생성**되어 있는 경우나 **권한이 없어 생성할 수 없는 경우**는 **예외**발생")
#_result = os.makedirs("test_dir")
print(" * 하위 폴더를 for문으로 탐색 : os.walk(path) ", " 기본적으로 top-down임. bottom-up으로 하고 싶다면 ")
if __name__ == "__main__":
root_dir = "./"
for (root, dirs, files) in os.walk(root_dir):
print("root : " + root)
if len(dirs) > 0:
for dir_name in dirs:
print("dir: " + dir_name)
if len(files) > 0:
for file_name in files:
print("file: " + file_name)
os.walk(root_dir, topdown=False) #bottom-up으로 하고 싶다면,
print(" * 'A'+'/'+'B' 로 문자열을 return 한다. : os.path.join('A','B) ")
print(" os.path.isdir(): directory인가? ")
print(os.getcwd(), " ==> " , os.path.isdir(os.getcwd()) )
print(" os.path.abspath(path): abs 경로 반환 ")
print(" 현재 ./sketchpy_001.py 파일의 절대경로 : ", os.path.abspath("./sketchpy_001.py"))
print(" os.path.dirname(path) : 경로의 제일 뒤 빼고 반환 ")
print(" os.path.exists(path) : 지정한 path에 파일, 디렉토리가 존재하는지 유(True)/무(False) 리턴 ")
print("os.path.isfile(path) ")
print("os.path.isdir(path) ")
print("os.path.isabs() ")
print("")
import glob
print( " glob.glob(os.getcwd()) : ls와 유사한 기능을한다, 정규식 사용 가능 (* ? [0-9])" )
print(" (workspace) $ ls = glob.glob(os.getcwd()+'/*')) ")
print(" list로 return ")
print( glob.glob(os.getcwd()))
print( glob.glob(os.getcwd() + "/*"))
print( "" )
print(" glob.iglob(path) ")
print(" - glob.glob와 다르게 iterator로 반환 ")
print(" - list로 담지 않기 때문에 결과가 매우 많다면 유용함 ")
for i in glob.iglob(os.getcwd()+'/*'):
print(i)
"""
인코딩 정보
"""
import os
import sys
#작업하는 경로(위치)가 어디인지 확인
#print(os.getcwd())
prePath_in = "./Project/"
prePath_out = "./Project/"
#1. 기본 내용적기
# 기본 텍스트 신규 입력
file = open("test.txt", "w")
file.write("내용입력")
file.close()
# 한글깨짐 방지 ENCODING UTF-8
file = open("test.txt", "w", encoding="UTF-8")
file.write("내용입력")
file.close()
# 한글깨짐 방지2 ENCODING UTF-8
# txt는 UTF-8로도 충분한데 csv는 UTF-8로만 하면 읽을땐 다른걸로 읽을 경우 깨짐 현상 발생
file = open("test.csv", "w", encoding="UTF-8-sig")
file.write("test,test,test\n")
file.write("잘되나,안된다,오된다\n")
file.close()
# 참고
# Permission denied: 'test.csv' 가 나온다
# 파일 열고 있어서 수정할 수 없다는거다. 꺼주자.
# 추가 입력
file = open("test.txt", "a")
file.write("추가 내용입력")
file.close()
# 읽기
file = open("test.txt", "r", encoding="UTF-8")
print(file.read())
file.close()
# with 함수 : open & close 포함
with open("test.txt", "w", encoding="UTF-8") as file:
file.write("내용입력")
with open("test.txt", "r", encoding="UTF-8") as file:
print(file.read())
#2. print 한거 txt 파일에 넣기
# sys.stdout 함수 사용하여 log 저장하기
f = open('test.txt','w', encoding='utf-8') # 로그 저장할 file open
sys.stdout = f
print("내용입력")
sys.stdout = sys.__stdout__ # 원래의 stdout으로 복구
f.close() # 로그 파일 닫기
#이렇게 해도 되긴 하는데.. 프로그램이 종료 안되면 문제가 생길듯?
sys.stdout = open('test.txt','w', encoding='utf-8')
print("내용입력")