Using Machine Learning to Solve Problems: Exploring Book Styles

緯緯道來
·
·
IPFS
·

Foreword & Overview

This is the ninth article on getting started with machine learning concepts . In the previous article, we explained how to solve the problem of " house price forecasting " through the five steps of machine learning ( defining the problem , building a data set , model training , model evaluation and model inference ).

Housing price prediction belongs to supervised learning (Supervised Learning). In this article, I will use "Explore Book Style" as an example of Unsupervised Learning.

Step 1 : Define the problem

Suppose you are the manager of a large bookstore and you want to explore the styles of all the books sold in the bookstore over the past year. Although each book has its classification, such as: English license, programming language, natural science, and so on. However, what you want to know is not just the style of the "big direction", but you want to be able to grasp more in-depth details.

If the bookstore did not perform well last year and only sold 10 books, maybe you can get up and read the ten books to get a general idea of what customers like. However, what if you sell more than 1,000 copies? At this point, we can use the technology of machine learning to help us find the potential style of these 1000 books.

We assume that each book is accompanied by an abstract, which can provide a general understanding of the content of the book. We feed summaries of these 1000 books into the model, and the model explores the styles based on these summaries and groups books with similar styles into the same group.

In the above process, we only input the abstract of each book into the model, and do not predetermine the style of each abstract, the style will be explored by the model itself. Because we do not provide the correct answer (label) to the model, it is "unsupervised learning".

In addition, the model finds potential relationships between books based on the abstracts of the books, and assigns books with a closer relationship to the same group, so it belongs to the "Clustering task".

Step 2 : Create a dataset

The process of building a dataset can be divided into the following stages:

  • Data Collection
    There are quite a variety of methods for collecting data. We can hire working students to type out the abstracts of each book on the computer, or collect abstracts of these books on the Internet in the form of crawler.
  • Data Exploration and Data Cleaning
    - During this phase, we deeply understand the data we have collected and process the data into a form suitable for the input model. For example, the material we prepare is a "summary" consisting of many sentences. Sentences may contain many unimportant elements, we can remove them first, in English sentences:
    - remove punctuation (, . ! ?)
    - delete unimportant words (a, an, the)
    - Convert uppercase to lowercase (White => white)
    - Unify verb tenses into present tense (did => does)
    - After cleaning the data, the data must be converted from "string type" to "numeric type" before being input into the model. This process is called Data Vectorization .

Step 3 : Model training

After the data set is established, enter the model establishment and training phase. In defining the problem , we already know that the "exploring book style" task belongs to the "clustering problem", and the most common model for solving the clustering problem is K-Means. In this article, we will not go into depth about the principles of K-Means.

As shown in the figure above, each dot represents the abstract of a book. Through the K-Means model, we can divide these dots (books) into multiple groups. When we specify k=2, all the points will be divided into two groups (as shown on the left of the above figure); when k=3, all the points will be divided into three groups (as shown on the right of the above figure).

Step 4 : Model Evaluation

In the model evaluation stage, we can evaluate the quality of the model through a variety of statistical indicators (metrics). As shown in the figure above, we want to find the most suitable number of clusters (k), so that all similar books can be classified into the same group. We can use silhouette coefficient to help us find the most suitable k.

Through the silhouette coefficient , we can observe the quality of the model under different k. As shown in the figure below, the optimal k value is 19.

When we find the most suitable k, we can observe the largest Cluster, that is, the Cluster that contains the most samples (books). Through the book summaries in this Cluster, you can get a deeper understanding of the most popular book styles.

Step 5 : Model Inference

During the model inference phase, we can start using the model. Enter a new book abstract into the model, observe which Cluster the abstract is assigned to, or observe book abstracts in other clusters to understand what similar styles exist in books in the same cluster, and are affected by the machine learning model. classified into the same group.

Epilogue

In this article, we solve the unsupervised learning example of "exploring book style" through five steps of machine learning , and also understand the application of K-Means Model and silhouette coefficient indicators. In the next article, we will introduce a more powerful model (Neural Network) to solve more difficult problems.

CC BY-NC-ND 2.0

Like my work? Don't forget to support and clap, let me know that you are with me on the road of creation. Keep this enthusiasm together!

緯緯道來研究所學生,主修資訊工程,熱衷於深度學習與機器學習。初期先以基本的程式教學為主,希望我的文章能夠幫助到你!(https://linktr.ee/johnnyhwu)
  • Author
  • More

Python 中 if __name__ == “__main__” 有什麼用處

近期的心情寫照

Python Module 觀念解析