【Python 資料科學教程】環境架設、Numpy, Pandas 基礎 - Data Science with Python
不管是提升工作技能或對資料有興趣,都可以簡單上手 Python 數據分析,文末我們將提供 Colab 範本給讀者進行練習。滑到文末
*Colab 是什麼?
>> Colaboratory (簡稱為「Colab」) 可讓你在瀏覽器上撰寫及執行 Python,且具備下列優點:
- 不必進行任何設定
- 免費使用 GPU
- 輕鬆共用
教程大綱
- Environment Setup
- Data Processing
- Explortary / Statistical Data Analysis
- Feature Engineering – Feature Selection
- Machine Learning Model Training
- Supervised Learning
- Classification
- KNN
- Regression
- Classification
- Unsupervised Learning
- Clustering
- Association Rule Learning
- Supervised Learning
- Deep Learning Model Training
- Time Series Data
- LSTM
- GRU
- Natural Language Processing (NLP)
- Image Recognition
- Other
- Time Series Data
1. Environment Setup
開始寫 Python 程式進行資料處理之前,我們要先做好環境架設:
Windows
安裝 Anaconda 及 Python:
- 下載 Anaconda installer
- 打開 “Anaconda Prompt“
- 輸入
conda list
來確認是否安裝成功 - 輸入
python
來確認你目前的 Python 版本(輸入quit()
可以跳出 Python shell)
安裝完成後,我們要先建立虛擬環境,再來安裝需要的 package
- 打開 “Anaconda Prompt”
- 創建虛擬環境:
輸入conda create --name env_name python=3.7 anaconda
anaconda
: 這個指令是為了讓創建的虛擬環境自動納入 anaconda 預設的 packages
- 啟動虛擬環境:
conda activate env_name
當你看到前面換成 (env_name) 時,便成功啟動了
2. Data Processing
Numpy Basics
- 可對陣列進行數學或邏輯運算
- 線性代數運算
- 產生隨機亂數等
Numpy 1D Arrays 一維陣列
1. broadcast operations:對 ndarray 可以進行 broadcast 數學運算,對 list 則無法
example_list = [45, 69, 94, 40, 694, 596, 504]
example_array = np.array(example_list)
2. condition selection:對 ndarray 可以進行邏輯篩選,對 list 則無法
filter = example_list > 50
Numpy 2D Arrays 二維陣列
3. slicing:對 ndarray 可以進行切片選取範圍,對 list 則無法
example_list = [[5,6,7,8,9],
[7,8,9,10,11],
[9,10,11,12,13]]
example_list[:, 1:4]
example_array[:, 1:4]
4. array operations:對 ndarray 可以進行陣列相乘,list 則無法
multiplier = [34, 78, 90, 5, 9]
example_list * multiplier
example_array * np.array(multiplier)
5. basic statistics functions
- Average:
np.mean()
- Median:
np.median()
- Standard derivation:
np.std()
- Pearson’s correlation:
np.corrcoef()
In 1D Array:
def func(x, axis):
print(np.mean(example_list, axis=axis))
print(np.median(example_list, axis=axis))
print(np.std(example_list, axis=axis))
print(np.corrcoef(example_list), '\n')
In 2D Array:
- axis=0: verticality operation
- axis=1: horizontally operation
example_array = np.array(example_list)
print('verticality: ')
func(example_array, axis=0)
print('horizontally: ')
func(example_array, axis=1)
Pandas Basics
2 main data structures:
- Series
- DataFrame
Functionalities:
- slicing, indexing and subsetting
- groupby
- reshape
- pivot_table
- merge, join, concat, etc.
Series
- create series
# create from ndarray
data = np.random.randn(5)
pd.Series(data, index=['a','b','c','d','e'])
# create from dictionary
data = {
'Facebook': 'Mark Zuckerberg',
'Apple': 'Steve Jobs',
'Amazon': 'Jeff Bezos',
'Netflix': 'Reed Hastings',
'Google': 'Larry Page'
}
pd.Series(data)
2. demonstrate operations: array-like operations
data = np.random.randn(5)
ser = pd.Series(data)
ser[ser > ser.mean()]
3. demonstrate operations: dict-like operations
ser
ser['a']
DataFrame
- create series
# create from dict of series
data = {
'one': pd.Series(np.random.randn(5), index=['a','b','c','d','e']),
'two': pd.Series(np.random.randn(5), index=['a','b','c','d','e']),
}
df = pd.DataFrame(data)
df
# create from dict of ndarrays/lists
data = {
'one': np.random.randn(5),
'two': np.random.randn(5)
}
df = pd.DataFrame(data)
df
# create from list of dicts
data = [{
'a': 1,
'b': 2
}, {
'a': 3,
'b': 4,
'c': 6
}]
df = pd.DataFrame(data)
df
# create a MultiIndexed dataframe from tuples dict
data = {
('top-1','medium-1'): {('I','i'): 1, ('I','ii'): 2},
('top-1','medium-2'): {('I','ii'): 1, ('I','iii'): 2},
('top-2','medium-1'): {('I','i'): 1, ('I','ii'): 2},
('top-2','medium-2'): {('I','i'): 1, ('I','iii'): 2}
}
df = pd.DataFrame(data)
df
Colab 範本
1. Environment Setup
2-1. Numpy Basics
2-2. Pandas Basics