【Python 資料科學教程】環境架設、Numpy, Pandas 基礎 - Data Science with Python

JY的興趣行李箱
·
·
IPFS
·

不管是提升工作技能或對資料有興趣,都可以簡單上手 Python 數據分析,文末我們將提供 Colab 範本給讀者進行練習。滑到文末

*Colab 是什麼?
>> Colaboratory (簡稱為「Colab」) 可讓你在瀏覽器上撰寫及執行 Python,且具備下列優點:

  • 不必進行任何設定
  • 免費使用 GPU
  • 輕鬆共用

教程大綱

  1. Environment Setup
  2. Data Processing
  3. Explortary / Statistical Data Analysis
  4. Feature Engineering – Feature Selection
  5. Machine Learning Model Training
    • Supervised Learning
      • Classification
        • KNN
      • Regression
    • Unsupervised Learning
      • Clustering
      • Association Rule Learning
  6. Deep Learning Model Training
    • Time Series Data
      • LSTM
      • GRU
    • Natural Language Processing (NLP)
    • Image Recognition
    • Other

1. Environment Setup

開始寫 Python 程式進行資料處理之前,我們要先做好環境架設:

Windows

安裝 AnacondaPython:

  1. 下載 Anaconda installer
  2. 打開 “Anaconda Prompt
  3. 輸入 conda list 來確認是否安裝成功
  4. 輸入 python 來確認你目前的 Python 版本(輸入 quit() 可以跳出 Python shell)

安裝完成後,我們要先建立虛擬環境,再來安裝需要的 package

  1. 打開 “Anaconda Prompt”
  2. 創建虛擬環境:
    輸入conda create --name env_name python=3.7 anaconda
    • anaconda: 這個指令是為了讓創建的虛擬環境自動納入 anaconda 預設的 packages
  3. 啟動虛擬環境:
    conda activate env_name

當你看到前面換成 (env_name) 時,便成功啟動了

2. Data Processing

Numpy Basics

  • 可對陣列進行數學或邏輯運算
  • 線性代數運算
  • 產生隨機亂數等

Numpy 1D Arrays 一維陣列

 1. broadcast operations:對 ndarray 可以進行 broadcast 數學運算,對 list 則無法

example_list = [45, 69, 94, 40, 694, 596, 504]
example_array = np.array(example_list)

 2. condition selection:對 ndarray 可以進行邏輯篩選,對 list 則無法

filter = example_list > 50

Numpy 2D Arrays 二維陣列

 3. slicing:對 ndarray 可以進行切片選取範圍,對 list 則無法

example_list = [[5,6,7,8,9],
                [7,8,9,10,11],
                [9,10,11,12,13]]
example_list[:, 1:4]
example_array[:, 1:4]

 4. array operations:對 ndarray 可以進行陣列相乘,list 則無法

multiplier = [34, 78, 90, 5, 9]
example_list * multiplier
example_array * np.array(multiplier)

 5. basic statistics functions

  • Average: np.mean()
  • Median: np.median()
  • Standard derivation: np.std()
  • Pearson’s correlation: np.corrcoef()

In 1D Array:

def func(x, axis):
    print(np.mean(example_list, axis=axis))
    print(np.median(example_list, axis=axis))
    print(np.std(example_list, axis=axis))
    print(np.corrcoef(example_list), '\n')

In 2D Array:

  • axis=0: verticality operation
  • axis=1: horizontally operation
example_array = np.array(example_list)
print('verticality: ')
func(example_array, axis=0)
print('horizontally: ')
func(example_array, axis=1)

Pandas Basics

2 main data structures:

  • Series
  • DataFrame

Functionalities:

  • slicing, indexing and subsetting
  • groupby
  • reshape
  • pivot_table
  • merge, join, concat, etc.

Series

  1. create series
# create from ndarray
data = np.random.randn(5)
pd.Series(data, index=['a','b','c','d','e'])
# create from dictionary
data = {
    'Facebook': 'Mark Zuckerberg',
    'Apple': 'Steve Jobs',
    'Amazon': 'Jeff Bezos',
    'Netflix': 'Reed Hastings',
    'Google': 'Larry Page'
}
pd.Series(data)

 2. demonstrate operations: array-like operations

data = np.random.randn(5)
ser = pd.Series(data)
ser[ser > ser.mean()]

 3. demonstrate operations: dict-like operations

ser
ser['a']

DataFrame

  1. create series
# create from dict of series
data = {
    'one': pd.Series(np.random.randn(5), index=['a','b','c','d','e']),
    'two': pd.Series(np.random.randn(5), index=['a','b','c','d','e']),
}
df = pd.DataFrame(data)
df
# create from dict of ndarrays/lists
data = {
    'one': np.random.randn(5),
    'two': np.random.randn(5)
}
df = pd.DataFrame(data)
df
# create from list of dicts
data = [{
    'a': 1,
    'b': 2
}, {
    'a': 3,
    'b': 4,
    'c': 6
}]
df = pd.DataFrame(data)
df
# create a MultiIndexed dataframe from tuples dict
data = {
    ('top-1','medium-1'): {('I','i'): 1, ('I','ii'): 2},
    ('top-1','medium-2'): {('I','ii'): 1, ('I','iii'): 2},
    ('top-2','medium-1'): {('I','i'): 1, ('I','ii'): 2},
    ('top-2','medium-2'): {('I','i'): 1, ('I','iii'): 2}
}
df = pd.DataFrame(data)
df

Colab 範本

1. Environment Setup
2-1. Numpy Basics
2-2. Pandas Basics

線上支持這個教程

CC BY-NC-ND 2.0 授权

喜欢我的作品吗?别忘了给予支持与赞赏,让我知道在创作的路上有你陪伴,一起延续这份热忱!

JY的興趣行李箱一個數據分析師的個人興趣分享,走到哪寫到哪 可能是程式技能分享、綜藝點評、舞台/歌曲/樂曲/電影收藏、書摘、產業觀察等
  • 来自作者
  • 相关推荐

Git 協作不可不知的重要指令

【Python】複製一份 conda 環境的各種方式

【資料分析】認識統計顯著性|A/B Testing 觀測數值增減多少才是顯著有效?