Huawei AI Certification Training

HCIA-AI

Machine Learning

Experiment Guide

 

 

ISSUE:3.0

 

 

 

 

 

 

HUAWEI TECHNOLOGIES CO., LTD.

 

Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved.

No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd.

 

Trademarks and Permissions

 and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.

All other trademarks and trade names mentioned in this document are the property of their respective holders.

 

Notice

The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied.

 

 

 

 

 

Huawei Technologies Co., Ltd.

Address:

Huawei Industrial Base Bantian, Longgang Shenzhen 518129 

People's Republic of China

Website:

http://e.huawei.com

 

 

 

Huawei Certificate System

Huawei Certification follows the "platform + ecosystem" development strategy, which is a new collaborative architecture of ICT infrastructure based on "Cloud-Pipe-Terminal". Huawei has set up a complete certification system consisting of three categories: ICT infrastructure certification, Platform and Service certification and ICT vertical certification, and grants Huawei certification the only all-range technical certification in the industry.
     Huawei offers three levels of certification: Huawei Certified ICT Associate (HCIA), Huawei Certified ICT Professional (HCIP), and Huawei Certified ICT Expert (HCIE).

HCIA-AI V3.0 aims to train and certify engineers who are capable of designing and developing AI products and solutions using algorithms such as machine learning and deep learning.

HCIA-AI V3.0 certification demonstrates that: You know the development history of AI, Huawei Ascend AI system and full-stack all-scenario AI strategies, and master traditional machine learning and deep learning algorithms; you can use the TensorFlow and MindSpore development frameworks to build, train, and deploy neural networks; you are competent for sales, marketing, product manager, project management, and technical support positions in the AI field.

 


 


 

About This Document

Overview

This document is applicable to the candidates who are preparing for the HCIA-AI exam and the readers who want to understand the AI programming basics. After learning this guide, you will be able to perform basic machine learning programming.

Description

This guide contains one experiment, which is based on how to use sklearn-learn and python packages to predict house prices in Boston using different regression algorithms. It is hoped that trainees or readers can get started with machine learning and have the basic programming capability of machine learning building.

Background Knowledge Required

To fully understand this course, the readers should have basic Python programming capabilities, knowledge of data structures and mechine learning algorithms.

Experiment Environment Overview

Python Development Tool

This experiment environment is developed and compiled based on Python 3.6 and XGBoost will be used.


 

Contents

About This Document 3

Overview 3

Description 3

Background Knowledge Required 3

Experiment Environment Overview 3

1 Detail of linear regression 5

1.1 Introduction 5

1.1.1 About This Experiment 5

1.1.2 Objectives 5

1.2 Experiment Code 5

1.2.1 Data preparation 5

1.2.2 Define related functions 6

1.2.3 Start the iteration 7

1.3 Thinking and practice 12

1.3.1 Question 1 12

1.3.2 Question 2 12

2 Decision tree details 13

2.1 Introduction 13

2.1.1 About This Experiment 13

2.1.2 Objectives 13

2.2 Experiment Code 13

2.2.1 Import the modules you need 13

2.2.2 Superparameter definition section 13

2.2.3 Define the functions required to complete the algorithm 14

2.2.4 Execute the code 19

 

 

 

 

 

 

 

 

 

  1. Detail of linear regression

  2. Introduction

  3. About This Experiment

This experiment mainly uses basic Python code and the simplest data to reproduce how a linear regression algorithm iterates and fits the existing data distribution step by step.

The experiment mainly used Numpy module and Matplotlib module.Numpy for calculation, Matplotlib for drawing.

  1. Objectives

The main purpose of this experiment is as follows.

  1. Experiment Code

  2. Data preparation

10 data were randomly set, and the data were in a linear relationship.

The data is converted to array format so that it can be computed directly when multiplication and addition are used.

Code:

#Import the required modules, numpy for calculation, and Matplotlib for drawing

import numpy as np

import matplotlib.pyplot as plt

#This code is for jupyter Notebook only

%matplotlib inline

 

# define data, and change list to array

x = [3,21,22,34,54,34,55,67,89,99]

x = np.array(x)

y = [1,10,14,34,44,36,22,67,79,90]

y = np.array(y)

 

#Show the effect of a scatter plot

plt.scatter(x,y)

 

Output:

  1. Scatter Plot

  2. Define related functions

Model function: Defines a linear regression model wx+b.

Loss function: loss function of Mean square error.

Optimization function: gradient descent method to find partial derivatives of w and b.

 

Code:

#The basic linear regression model is wx+ b, and since this is a two-dimensional space, the model is ax+ b

 

def model(a, b, x):

return a*x + b

 

#The most commonly used loss function of linear regression model is the loss function of mean variance difference

def loss_function(a, b, x, y):

    num = len(x)

    prediction=model(a,b,x)

    return (0.5/num) * (np.square(prediction-y)).sum()

 

#The optimization function mainly USES partial derivatives to update two parameters a and b

def optimize(a,b,x,y):

    num = len(x)

    prediction = model(a,b,x)

    #Update the values of A and B by finding the partial derivatives of the loss function on a and b

    da = (1.0/num) * ((prediction -y)*x).sum()

    db = (1.0/num) * ((prediction -y).sum())

    a = a - Lr*da

    b = b - Lr*db

return a, b

 

#iterated function, return a and b

def iterate(a,b,x,y,times):

    for i in range(times):

        a,b = optimize(a,b,x,y)

    return a,b

  1. Start the iteration

  2. Initialization and Iterative optimization model

Code:

#Initialize parameters and display

a = np.random.rand(1)

print(a)

b = np.random.rand(1)

print(b)

Lr = 1e-4

 

#For the first iteration, the parameter values, losses, and visualization after the iteration are displayed

a,b = iterate(a,b,x,y,1)

prediction=model(a,b,x)

loss = loss_function(a, b, x, y)

print(a,b,loss)

plt.scatter(x,y)

plt.plot(x,prediction)

Output:

  1.  Iterate 1 time

  2. In the second iteration, the parameter values, loss values and visualization effects after the iteration are displayed

Code:

a,b = iterate(a,b,x,y,2)

prediction=model(a,b,x)

loss = loss_function(a, b, x, y)

print(a,b,loss)

plt.scatter(x,y)

plt.plot(x,prediction)

Output:

  1.  Iterate 2 times

  2. The third iteration shows the parameter values, loss values and visualization after iteration

Code:

a,b = iterate(a,b,x,y,3)

prediction=model(a,b,x)

loss = loss_function(a, b, x, y)

print(a,b,loss)

plt.scatter(x,y)

plt.plot(x,prediction)

Output:

  1.  Iterate 3 times

  2. In the fourth iteration, parameter values, loss values and visualization effects are displayed

Code:

a,b = iterate(a,b,x,y,4)

prediction=model(a,b,x)

loss = loss_function(a, b, x, y)

print(a,b,loss)

plt.scatter(x,y)

plt.plot(x,prediction)

Output:

  1. Iterate 4 times

  2. The fifth iteration shows the parameter value, loss value and visualization effect after iteration

Code:

a,b = iterate(a,b,x,y,5)

prediction=model(a,b,x)

loss = loss_function(a, b, x, y)

print(a,b,loss)

plt.scatter(x,y)

plt.plot(x,prediction)

Output:

  1. Iterate 5 times

  2. The 10000th iteration, showing the parameter values, losses and visualization after iteration

Code:

a,b = iterate(a,b,x,y,10000)

prediction=model(a,b,x)

loss = loss_function(a, b, x, y)

print(a,b,loss)

plt.scatter(x,y)

plt.plot(x,prediction)

Output:

  1.  Iterate 10000 times

  2. Thinking and practice

  3. Question 1

Try to modify the original data yourself, Think about it: Does the loss value have to go to zero

  1. Question 2

Modify the values of Lr, Think: What is the role of the Lr parameter

 

 

 

 

 

 

 

 


 

 

  1. Decision tree details

  2. Introduction

  3. About This Experiment

This experiment focuses on the decision tree algorithm through the basic Python code.

It mainly uses Numpy module, Pandas module and Math module. We will implement the CART treeClassification and Regressiontree models) in this experiment.

You have to download the dataset before this experiment through this link:

https://certification-data.obs.cn-north-4.myhuaweicloud.com/ENG/HCIA-AI/V3.0/ML-Dataset.rar

  1.  Objectives

The purpose of this experiment is as follows:

  1. Experiment Code

  2. Import the modules you need

Pandas is a tabular data processing module.

Math is mainly used for mathematical calculations.

Numpy is the basic computing module.

Code:

import pandas as pd

import math

import numpy as np

  1. Superparameter definition section

Here you can choose to use Classification tree or Regression tree. Specifies the address of the dataset. Get feature name. Determine whether the algorithm matches the data set

Code:

algorithm = "Regression"  #  Algorithm: Classification, Regression

algorithm = "Classification"  #  Algorithm: Classification, Regression

 

# Dataset1: Text features and text labels

#df = pd.read_csv("D:/Code/Decision Treee/candidate/decision-trees-for-ml-master/decision-trees-for-ml-master/dataset/golf.txt")

 

# Dataset2: Mix features and Numeric labels, here you have to change the path to yours.

df = pd.read_csv("ML-Dataset/golf4.txt")

 

# This dictionary is used to store feature types of continuous numeric features and discrete literal features for subsequent judgment

dataset_features = dict()

 

num_of_columns = df.shape[1]-1

#The data type of each column of the data is saved for displaying the data name

for i in range(0, num_of_columns):

    #Gets the column name and holds the characteristics of a column of data by column

    column_name = df.columns[i]

    #Save the type of the data

    dataset_features[column_name] = df[column_name].dtypes

# The size of the indent when display

root = 1

 

# If the algorithm selects a regression tree but the label is not a continuous value, an error is reported

if algorithm == 'Regression':

    if df['Decision'].dtypes == 'object':

        raise ValueError('dataset wrong')

# If the tag value is continuous, the regression tree must be used

if df['Decision'].dtypes != 'object':

    algorithm = 'Regression'

    global_stdev = df['Decision'].std(ddof=0)

  1. Define the functions required to complete the algorithm

  2. ProcessContinuousFeatures: Used to convert a continuous digital feature into a category feature.

Code:

# This function is used to handle numeric characteristics

def processContinuousFeatures(cdf, column_name, entropy):

    # Numerical features are arranged in order

    unique_values = sorted(cdf[column_name].unique())

 

    subset_ginis = [];

    subset_red_stdevs = []

 

    for i in range(0, len(unique_values) - 1):

        threshold = unique_values[i]

        # Find the segmentation result if the first number is used as the threshold

        subset1 = cdf[cdf[column_name] <= threshold]

        subset2 = cdf[cdf[column_name] > threshold]

        # Calculate the proportion occupied by dividing the two parts

        subset1_rows = subset1.shape[0];

        subset2_rows = subset2.shape[0]

        total_instances = cdf.shape[0]

        # In the text feature part, entropy is calculated by using the cycle,

        # and in the numeric part, entropy is calculated by using the two groups after segmentation,

        # and the degree of entropy reduction is obtained

        if algorithm == 'Classification':

            decision_for_subset1 = subset1['Decision'].value_counts().tolist()

            decision_for_subset2 = subset2['Decision'].value_counts().tolist()

 

            gini_subset1 = 1;

            gini_subset2 = 1

 

            for j in range(0, len(decision_for_subset1)):

                gini_subset1 = gini_subset1 - math.pow((decision_for_subset1[j] / subset1_rows), 2)

 

            for j in range(0, len(decision_for_subset2)):

                gini_subset2 = gini_subset2 - math.pow((decision_for_subset2[j] / subset2_rows), 2)

 

            gini = (subset1_rows / total_instances) * gini_subset1 + (subset2_rows / total_instances) * gini_subset2

 

            subset_ginis.append(gini)

 

        # Take standard deviation as the judgment basis, calculate the decrease value of standard deviation at this time

        elif algorithm == 'Regression':

            superset_stdev = cdf['Decision'].std(ddof=0)

            subset1_stdev = subset1['Decision'].std(ddof=0)

            subset2_stdev = subset2['Decision'].std(ddof=0)

 

            threshold_weighted_stdev = (subset1_rows / total_instances) * subset1_stdev + (

                        subset2_rows / total_instances) * subset2_stdev

            threshold_reducted_stdev = superset_stdev - threshold_weighted_stdev

            subset_red_stdevs.append(threshold_reducted_stdev)

 

    #Find the index of the split value

    if algorithm == "Classification":

        winner_one = subset_ginis.index(min(subset_ginis))

    elif algorithm == "Regression":

        winner_one = subset_red_stdevs.index(max(subset_red_stdevs))

    # Find the corresponding value according to the index

    winner_threshold = unique_values[winner_one]

 

    # Converts the original data column to an edited string column.

    # Characters smaller than the threshold are modified with the <= threshold value

    cdf[column_name] = np.where(cdf[column_name] <= winner_threshold, "<=" + str(winner_threshold),">" + str(winner_threshold))

 

    return cdf

  1. CalculateEntropy: Used to calculate Gini or variances, they are the criteria for classifying.

Code:

# This function calculates the entropy of the column, and the input data must contain the Decision column

def calculateEntropy(df):

    # The regression tree entropy is 0

    if algorithm == 'Regression':

        return 0

   

    rows = df.shape[0]

    # Use Value_counts to get all values stored as dictionaries, keys: finds keys, and Tolist: change to  lists.

    # This line of code finds the tag value.

    decisions = df['Decision'].value_counts().keys().tolist()

 

    entropy = 0

    # Here the loop traverses all the tags

    for i in range(0, len(decisions)):

        # Record the number of times the tag value appears

        num_of_decisions = df['Decision'].value_counts().tolist()[i]

        # probability of occurrence

        class_probability = num_of_decisions / rows

        # Calculate the entropy and sum it up

        entropy = entropy - class_probability * math.log(class_probability, 2)

 

    return entropy

  1. FindDecision: Find which feature in the current data to classify. 

Code:

# The main purpose of this function is to traverse the entire column of the table,

# find which column is the best split column, and return the name of the column

def findDecision(ddf):

    # If it's a regression tree, then you take the standard deviation of the true value

    if algorithm == 'Regression':

        stdev = ddf['Decision'].std(ddof=0)

    # Get the entropy of the decision column

    entropy = calculateEntropy(ddf)

 

    columns = ddf.shape[1];

    rows = ddf.shape[0]

    # Used to store Gini and standard deviation values

    ginis = [];

    reducted_stdevs = []

    # Traverse all columns and calculate the relevant indexes of all columns according to algorithm selection

    for i in range(0, columns - 1):

        column_name = ddf.columns[i]

        column_type = ddf[column_name].dtypes

 

        # Determine if the column feature is a number, and if so, process the data using the following function,

        # which modifies the data to a string type category on return.

        # The idea is to directly use character characteristics, continuous digital characteristics into discrete character characteristics

        if column_type != 'object':

            ddf = processContinuousFeatures(ddf, column_name, entropy)

        # The statistical data in this column can be obtained, and the continuous data can be directly classified after processing,

        # and the categories are less than the threshold and greater than the threshold

        classes = ddf[column_name].value_counts()

        gini = 0;

        weighted_stdev = 0

        # Start the loop with the type of data in the column

        for j in range(0, len(classes)):

            current_class = classes.keys().tolist()[j]

            # The final classification result corresponding to the data is selected

            # by deleting the value of the df column equal to the current data

            subdataset = ddf[ddf[column_name] == current_class]

 

            subset_instances = subdataset.shape[0]

            # The entropy of information is calculated here

            if algorithm == 'Classification':  # GINI index

                decision_list = subdataset['Decision'].value_counts().tolist()

 

                subgini = 1

 

                for k in range(0, len(decision_list)):

                    subgini = subgini - math.pow((decision_list[k] / subset_instances), 2)

 

                gini = gini + (subset_instances / rows) * subgini

            # The regression tree is judged by the standard deviation,

            # and the standard deviation of the subclasses in this column is calculated here

            elif algorithm == 'Regression':

                subset_stdev = subdataset['Decision'].std(ddof=0)

                weighted_stdev = weighted_stdev + (subset_instances / rows) * subset_stdev

 

        # Used to store the final value of this column

        if algorithm == "Classification":

            ginis.append(gini)

        # Store the decrease in standard deviation for all columns

        elif algorithm == 'Regression':

            reducted_stdev = stdev - weighted_stdev

            reducted_stdevs.append(reducted_stdev)

 

    # Determine which column is the first branch

    # by selecting the index of the largest value from the list of evaluation indicators

    if algorithm == "Classification":

        winner_index = ginis.index(min(ginis))

    elif algorithm == "Regression":

        winner_index = reducted_stdevs.index(max(reducted_stdevs))

    winner_name = ddf.columns[winner_index]

 

    return winner_name

  1. FormatRule: Standardize the final output format.

Code:

# ROOT is a number used to generate ' 'to adjust the display format of the decision making process

def formatRule(root):

    resp = ''

 

    for i in range(0, root):

        resp = resp + '   '

 

    return resp

  1. BuildDecisionTree: Main function.

Code:

# With this function, you build the decision tree model,

# entering data in dataframe format, the root value, and the file address

# If the value in the column is literal, it branches directly by literal category

def buildDecisionTree(df, root):

    # Identify the different charForResp

    charForResp = "'"

    if algorithm == 'Regression':

        charForResp = ""

 

    tmp_root = root * 1

 

    df_copy = df.copy()

    # Output the winning column of the decision tree, enter a list,

    # and output the column name of the decision column in the list

    winner_name = findDecision(df)

 

    # Determines whether the winning column is a number or a character

    numericColumn = False

    if dataset_features[winner_name] != 'object':

        numericColumn = True

 

    # To ensure the integrity of the original data and prevent the data from changing,

    # mainly to ensure that the data of other columns besides the winning column does not change,

    # so as to continue the branch in the next step.

    columns = df.shape[1]

    for i in range(0, columns - 1):

        column_name = df.columns[i]

        if df[column_name].dtype != 'object' and column_name != winner_name:

            df[column_name] = df_copy[column_name]

    # Find the element in the branching column

    classes = df[winner_name].value_counts().keys().tolist()

    # Traversing all classes in the branch column has two functions:

    # 1. Display which class is currently traversed to; 2. Determine whether the current class is already leaf node

    for i in range(0, len(classes)):

        # Find the Subdataset as in FindDecision, but discard this column of the current branch

        current_class = classes[i]

        subdataset = df[df[winner_name] == current_class]

        # At the same time, the data of the first branch column is discarded and the remaining data is processed

        subdataset = subdataset.drop(columns=[winner_name])

        # Edit the display situation. If it is a numeric feature, the character conversion has been completed when searching for branches.

        #If it is not a character feature, it is displayed with column names

        if numericColumn == True:

            compareTo = current_class  # current class might be <=x or >x in this case

        else:

            compareTo = " == '" + str(current_class) + "'"

 

        terminateBuilding = False

 

        # -----------------------------------------------

 

        # This determines whether it is already the last leaf node

        if len(subdataset['Decision'].value_counts().tolist()) == 1:

            final_decision = subdataset['Decision'].value_counts().keys().tolist()[

                0]  # all items are equal in this case

            terminateBuilding = True

        # At this time, only the Decision column is left, that is, all the segmentation features have been used

        elif subdataset.shape[1] == 1:

            # get the most frequent one

            final_decision = subdataset['Decision'].value_counts().idxmax() 

            terminateBuilding = True

        # The regression tree is judged as leaf node if the number of elements is less than 5

        #elif algorithm == 'Regression' and subdataset.shape[0] < 5:  # pruning condition

        # Another criterion is to take the standard deviation as the criterion and the sample mean in the node as the value of the node

        elif algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4:

            # get average

            final_decision = subdataset['Decision'].mean() 

            terminateBuilding = True

        # -----------------------------------------------

        # Here we begin to output the branching results of the decision tree.

 

        print(formatRule(root), "if ", winner_name, compareTo, ":")

 

        # -----------------------------------------------

        # check decision is made

        if terminateBuilding == True: 

            print(formatRule(root + 1), "return ", charForResp + str(final_decision) + charForResp)

        else:  # decision is not made, continue to create branch and leafs

            # The size of the indent at display represented by root

            root = root + 1

            # Call this function again for the next loop

            buildDecisionTree(subdataset, root)

 

        root = tmp_root * 1

  1. Execute the code

Code:

# call the function

buildDecisionTree(df, root)

Output:

  1. Regression tree result

 

  1. CART tree result