使用 Python 列出 DOCX 文檔中所有粗體字

Makzan

2021 年 11 月 17 日

（修改过）

IPFS

歡迎來到第一個星期的 #編程星期三，今個星期我示範一下如何使用 Python 讀取一個 DOCX 文件內的所有粗體字。

首先，若果未安裝 python-docx，可以通過 pip install python-docx 安裝處理 docx 的工具。

安裝後，我們可以通過 import docx 來匯入使用。

打開文件

使用 doc = Document("Sample Document.docx") 可以打開現有的一個檔案。例如，我準備了一個包括不同粗體字的 Word DOCX 檔案。

列出所有段落（包括標題）

打開檔案後，首先通常是列出一些內容來確認我們成功打開了正確的檔案。例如先列出所有段落。

for p in doc.paragraphs:
    print(p)

運行結果是一堆段落的物件：

對於每個段落，我們可以做甚麼？

首先，我通常會先看一看段落的文字。

for p in doc.paragraphs:
    print(p.text)

在每個段落中找出不同樣式段

下一步，我們需要在每個段落中，找出粗體字的部份，在每個段落中，不同的樣式會以不同的 run 來儲存。所以如果整段文字都是一種樣式或預設樣式，則那段文字只會有一個 run，但如果當中有不同樣式混合的，例如當中有粗體字等。所以通過對段落中的所有 runs 進行檢查，就可以找出 run.bold == True 的那些為粗體字。

for p in doc.paragraphs:
     for run in p.runs:
        print(run.bold, run.text)

放到一起

將我們所嘗握的放到一起，就能得出以下代碼及運行結果。

import docx
print("Printing all bold text from Sample Document.docx:")

doc = docx.Document("Sample Document.docx")

for p in doc.paragraphs:    
    for run in p.runs:
        if run.bold:
            print(run.text)

這個例子只列出一個特定的檔案，如果配合 glob，我們可以將上述代碼擴展至支援現有資料夾內的所有 DOCX 文件。

import docx
import glob

files = glob.glob("*.docx")

for file in files:
    print(f"Printing all bold text from {file}:")

    doc = docx.Document(file)

    for p in doc.paragraphs:    
        for run in p.runs:
            if run.bold:
                print(run.text)

列出所有表格（Table）

剛剛的例子還有個粗體其實是遺漏了的。就是表格中所包含的粗體字。這些表格是不包括在 paragraphs 物件列表的，所以我們需要進一步從表格中取得這些文字樣式。而表格中的欄與列分別以 cols 和 rows 來存取，當中每一格都會有各自的 paragraphs 列表。

所以，針對這些表格，我們可以擴展代碼並加入表格的循環：

import docx
import glob

files = glob.glob("*.docx")

for file in files:
    print(f"Printing all bold text from {file}:")

    doc = docx.Document(file)

    for p in doc.paragraphs:    
        for run in p.runs:
            if run.bold:
                print(run.text)

    
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for p in cell.paragraphs:
                    for run in p.runs:
                        if run.bold:
                            print(run.text)

但上述這段代碼入太多層了，對將來維護會有難度。這樣寫的原因是因為文件中的表格內的每格儲存格（Cell）都有其自己的段落列表。

其中一個優化方案，可以將從段落尋找粗體字的部份抽取出來造成函數。

def print_bold_text_in_paragraphs(paragraphs):
    for p in paragraphs:    
        for run in p.runs:
            if run.bold:
                print(run.text)

然後我們便只需要在每格調用函數

import docx
import glob

files = glob.glob("*.docx")

for file in files:
    print(f"Printing all bold text from {file}:")

    doc = docx.Document(file)

    print_bold_text_in_paragraphs(doc.paragraphs)

    # Iterate all cells in all tables
    for table in doc.tables:        
        for row in table.rows:
            for cell in row.cells:
                print_bold_text_in_paragraphs(cell.paragraphs)