Huawei AI Certification Training
HCIA-AI
Machine Learning
Experiment Guide
ISSUE:3.0
HUAWEI TECHNOLOGIES CO., LTD.
Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd.
Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders.
Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. |
Huawei Technologies Co., Ltd. | |
Address: | Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China |
Website: | http://e.huawei.com |
Huawei Certification follows the "platform + ecosystem" development strategy, which is a new collaborative architecture of ICT infrastructure based on "Cloud-Pipe-Terminal". Huawei has set up a complete certification system consisting of three categories: ICT infrastructure certification, Platform and Service certification and ICT vertical certification, and grants Huawei certification the only all-range technical certification in the industry.
Huawei offers three levels of certification: Huawei Certified ICT Associate (HCIA), Huawei Certified ICT Professional (HCIP), and Huawei Certified ICT Expert (HCIE).
HCIA-AI V3.0 aims to train and certify engineers who are capable of designing and developing AI products and solutions using algorithms such as machine learning and deep learning.
HCIA-AI V3.0 certification demonstrates that: You know the development history of AI, Huawei Ascend AI system and full-stack all-scenario AI strategies, and master traditional machine learning and deep learning algorithms; you can use the TensorFlow and MindSpore development frameworks to build, train, and deploy neural networks; you are competent for sales, marketing, product manager, project management, and technical support positions in the AI field.
This document is applicable to the candidates who are preparing for the HCIA-AI exam and the readers who want to understand the AI programming basics. After learning this guide, you will be able to perform basic machine learning programming.
This guide contains one experiment, which is based on how to use sklearn-learn and python packages to predict house prices in Boston using different regression algorithms. It is hoped that trainees or readers can get started with machine learning and have the basic programming capability of machine learning building.
To fully understand this course, the readers should have basic Python programming capabilities, knowledge of data structures and mechine learning algorithms.
This experiment environment is developed and compiled based on Python 3.6 and XGBoost will be used.
Background Knowledge Required 3
Experiment Environment Overview 3
1 Detail of linear regression 5
1.2.2 Define related functions 6
2.1.1 About This Experiment 13
2.2.1 Import the modules you need 13
2.2.2 Superparameter definition section 13
2.2.3 Define the functions required to complete the algorithm 14
This experiment mainly uses basic Python code and the simplest data to reproduce how a linear regression algorithm iterates and fits the existing data distribution step by step.
The experiment mainly used Numpy module and Matplotlib module.Numpy for calculation, Matplotlib for drawing.
The main purpose of this experiment is as follows.
Familiar with basic Python statements
Master the implementation steps of linear regression
10 data were randomly set, and the data were in a linear relationship.
The data is converted to array format so that it can be computed directly when multiplication and addition are used.
Code:
#Import the required modules, numpy for calculation, and Matplotlib for drawing
import numpy as np
import matplotlib.pyplot as plt
#This code is for jupyter Notebook only
%matplotlib inline
# define data, and change list to array
x = [3,21,22,34,54,34,55,67,89,99]
x = np.array(x)
y = [1,10,14,34,44,36,22,67,79,90]
y = np.array(y)
#Show the effect of a scatter plot
plt.scatter(x,y)
Output:
Scatter Plot
Model function: Defines a linear regression model wx+b.
Loss function: loss function of Mean square error.
Optimization function: gradient descent method to find partial derivatives of w and b.
Code:
#The basic linear regression model is wx+ b, and since this is a two-dimensional space, the model is ax+ b
def model(a, b, x):
return a*x + b
#The most commonly used loss function of linear regression model is the loss function of mean variance difference
def loss_function(a, b, x, y):
num = len(x)
prediction=model(a,b,x)
return (0.5/num) * (np.square(prediction-y)).sum()
#The optimization function mainly USES partial derivatives to update two parameters a and b
def optimize(a,b,x,y):
num = len(x)
prediction = model(a,b,x)
#Update the values of A and B by finding the partial derivatives of the loss function on a and b
da = (1.0/num) * ((prediction -y)*x).sum()
db = (1.0/num) * ((prediction -y).sum())
a = a - Lr*da
b = b - Lr*db
return a, b
#iterated function, return a and b
def iterate(a,b,x,y,times):
for i in range(times):
a,b = optimize(a,b,x,y)
return a,b
Initialization and Iterative optimization model
Code:
#Initialize parameters and display
a = np.random.rand(1)
print(a)
b = np.random.rand(1)
print(b)
Lr = 1e-4
#For the first iteration, the parameter values, losses, and visualization after the iteration are displayed
a,b = iterate(a,b,x,y,1)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Iterate 1 time
In the second iteration, the parameter values, loss values and visualization effects after the iteration are displayed
Code:
a,b = iterate(a,b,x,y,2)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Iterate 2 times
The third iteration shows the parameter values, loss values and visualization after iteration
Code:
a,b = iterate(a,b,x,y,3)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Iterate 3 times
In the fourth iteration, parameter values, loss values and visualization effects are displayed
Code:
a,b = iterate(a,b,x,y,4)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Iterate 4 times
The fifth iteration shows the parameter value, loss value and visualization effect after iteration
Code:
a,b = iterate(a,b,x,y,5)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Iterate 5 times
The 10000th iteration, showing the parameter values, losses and visualization after iteration
Code:
a,b = iterate(a,b,x,y,10000)
prediction=model(a,b,x)
loss = loss_function(a, b, x, y)
print(a,b,loss)
plt.scatter(x,y)
plt.plot(x,prediction)
Output:
Iterate 10000 times
Try to modify the original data yourself, Think about it: Does the loss value have to go to zero?
Modify the values of Lr, Think: What is the role of the Lr parameter?
This experiment focuses on the decision tree algorithm through the basic Python code.
It mainly uses Numpy module, Pandas module and Math module. We will implement the CART tree(Classification and Regressiontree models) in this experiment.
You have to download the dataset before this experiment through this link:
https://certification-data.obs.cn-north-4.myhuaweicloud.com/ENG/HCIA-AI/V3.0/ML-Dataset.rar
The purpose of this experiment is as follows:
Familiar with basic Python syntax
Master the principle of Classification tree and implement with Python code
Master the principle of Regression tree and implement with Python code
Pandas is a tabular data processing module.
Math is mainly used for mathematical calculations.
Numpy is the basic computing module.
Code:
import pandas as pd
import math
import numpy as np
Here you can choose to use Classification tree or Regression tree. Specifies the address of the dataset. Get feature name. Determine whether the algorithm matches the data set
Code:
algorithm = "Regression" # Algorithm: Classification, Regression
algorithm = "Classification" # Algorithm: Classification, Regression
# Dataset1: Text features and text labels
#df = pd.read_csv("D:/Code/Decision Treee/candidate/decision-trees-for-ml-master/decision-trees-for-ml-master/dataset/golf.txt")
# Dataset2: Mix features and Numeric labels, here you have to change the path to yours.
df = pd.read_csv("ML-Dataset/golf4.txt")
# This dictionary is used to store feature types of continuous numeric features and discrete literal features for subsequent judgment
dataset_features = dict()
num_of_columns = df.shape[1]-1
#The data type of each column of the data is saved for displaying the data name
for i in range(0, num_of_columns):
#Gets the column name and holds the characteristics of a column of data by column
column_name = df.columns[i]
#Save the type of the data
dataset_features[column_name] = df[column_name].dtypes
# The size of the indent when display
root = 1
# If the algorithm selects a regression tree but the label is not a continuous value, an error is reported
if algorithm == 'Regression':
if df['Decision'].dtypes == 'object':
raise ValueError('dataset wrong')
# If the tag value is continuous, the regression tree must be used
if df['Decision'].dtypes != 'object':
algorithm = 'Regression'
global_stdev = df['Decision'].std(ddof=0)
ProcessContinuousFeatures: Used to convert a continuous digital feature into a category feature.
Code:
# This function is used to handle numeric characteristics
def processContinuousFeatures(cdf, column_name, entropy):
# Numerical features are arranged in order
unique_values = sorted(cdf[column_name].unique())
subset_ginis = [];
subset_red_stdevs = []
for i in range(0, len(unique_values) - 1):
threshold = unique_values[i]
# Find the segmentation result if the first number is used as the threshold
subset1 = cdf[cdf[column_name] <= threshold]
subset2 = cdf[cdf[column_name] > threshold]
# Calculate the proportion occupied by dividing the two parts
subset1_rows = subset1.shape[0];
subset2_rows = subset2.shape[0]
total_instances = cdf.shape[0]
# In the text feature part, entropy is calculated by using the cycle,
# and in the numeric part, entropy is calculated by using the two groups after segmentation,
# and the degree of entropy reduction is obtained
if algorithm == 'Classification':
decision_for_subset1 = subset1['Decision'].value_counts().tolist()
decision_for_subset2 = subset2['Decision'].value_counts().tolist()
gini_subset1 = 1;
gini_subset2 = 1
for j in range(0, len(decision_for_subset1)):
gini_subset1 = gini_subset1 - math.pow((decision_for_subset1[j] / subset1_rows), 2)
for j in range(0, len(decision_for_subset2)):
gini_subset2 = gini_subset2 - math.pow((decision_for_subset2[j] / subset2_rows), 2)
gini = (subset1_rows / total_instances) * gini_subset1 + (subset2_rows / total_instances) * gini_subset2
subset_ginis.append(gini)
# Take standard deviation as the judgment basis, calculate the decrease value of standard deviation at this time
elif algorithm == 'Regression':
superset_stdev = cdf['Decision'].std(ddof=0)
subset1_stdev = subset1['Decision'].std(ddof=0)
subset2_stdev = subset2['Decision'].std(ddof=0)
threshold_weighted_stdev = (subset1_rows / total_instances) * subset1_stdev + (
subset2_rows / total_instances) * subset2_stdev
threshold_reducted_stdev = superset_stdev - threshold_weighted_stdev
subset_red_stdevs.append(threshold_reducted_stdev)
#Find the index of the split value
if algorithm == "Classification":
winner_one = subset_ginis.index(min(subset_ginis))
elif algorithm == "Regression":
winner_one = subset_red_stdevs.index(max(subset_red_stdevs))
# Find the corresponding value according to the index
winner_threshold = unique_values[winner_one]
# Converts the original data column to an edited string column.
# Characters smaller than the threshold are modified with the <= threshold value
cdf[column_name] = np.where(cdf[column_name] <= winner_threshold, "<=" + str(winner_threshold),">" + str(winner_threshold))
return cdf
CalculateEntropy: Used to calculate Gini or variances, they are the criteria for classifying.
Code:
# This function calculates the entropy of the column, and the input data must contain the Decision column
def calculateEntropy(df):
# The regression tree entropy is 0
if algorithm == 'Regression':
return 0
rows = df.shape[0]
# Use Value_counts to get all values stored as dictionaries, keys: finds keys, and Tolist: change to lists.
# This line of code finds the tag value.
decisions = df['Decision'].value_counts().keys().tolist()
entropy = 0
# Here the loop traverses all the tags
for i in range(0, len(decisions)):
# Record the number of times the tag value appears
num_of_decisions = df['Decision'].value_counts().tolist()[i]
# probability of occurrence
class_probability = num_of_decisions / rows
# Calculate the entropy and sum it up
entropy = entropy - class_probability * math.log(class_probability, 2)
return entropy
FindDecision: Find which feature in the current data to classify.
Code:
# The main purpose of this function is to traverse the entire column of the table,
# find which column is the best split column, and return the name of the column
def findDecision(ddf):
# If it's a regression tree, then you take the standard deviation of the true value
if algorithm == 'Regression':
stdev = ddf['Decision'].std(ddof=0)
# Get the entropy of the decision column
entropy = calculateEntropy(ddf)
columns = ddf.shape[1];
rows = ddf.shape[0]
# Used to store Gini and standard deviation values
ginis = [];
reducted_stdevs = []
# Traverse all columns and calculate the relevant indexes of all columns according to algorithm selection
for i in range(0, columns - 1):
column_name = ddf.columns[i]
column_type = ddf[column_name].dtypes
# Determine if the column feature is a number, and if so, process the data using the following function,
# which modifies the data to a string type category on return.
# The idea is to directly use character characteristics, continuous digital characteristics into discrete character characteristics
if column_type != 'object':
ddf = processContinuousFeatures(ddf, column_name, entropy)
# The statistical data in this column can be obtained, and the continuous data can be directly classified after processing,
# and the categories are less than the threshold and greater than the threshold
classes = ddf[column_name].value_counts()
gini = 0;
weighted_stdev = 0
# Start the loop with the type of data in the column
for j in range(0, len(classes)):
current_class = classes.keys().tolist()[j]
# The final classification result corresponding to the data is selected
# by deleting the value of the df column equal to the current data
subdataset = ddf[ddf[column_name] == current_class]
subset_instances = subdataset.shape[0]
# The entropy of information is calculated here
if algorithm == 'Classification': # GINI index
decision_list = subdataset['Decision'].value_counts().tolist()
subgini = 1
for k in range(0, len(decision_list)):
subgini = subgini - math.pow((decision_list[k] / subset_instances), 2)
gini = gini + (subset_instances / rows) * subgini
# The regression tree is judged by the standard deviation,
# and the standard deviation of the subclasses in this column is calculated here
elif algorithm == 'Regression':
subset_stdev = subdataset['Decision'].std(ddof=0)
weighted_stdev = weighted_stdev + (subset_instances / rows) * subset_stdev
# Used to store the final value of this column
if algorithm == "Classification":
ginis.append(gini)
# Store the decrease in standard deviation for all columns
elif algorithm == 'Regression':
reducted_stdev = stdev - weighted_stdev
reducted_stdevs.append(reducted_stdev)
# Determine which column is the first branch
# by selecting the index of the largest value from the list of evaluation indicators
if algorithm == "Classification":
winner_index = ginis.index(min(ginis))
elif algorithm == "Regression":
winner_index = reducted_stdevs.index(max(reducted_stdevs))
winner_name = ddf.columns[winner_index]
return winner_name
FormatRule: Standardize the final output format.
Code:
# ROOT is a number used to generate ' 'to adjust the display format of the decision making process
def formatRule(root):
resp = ''
for i in range(0, root):
resp = resp + ' '
return resp
BuildDecisionTree: Main function.
Code:
# With this function, you build the decision tree model,
# entering data in dataframe format, the root value, and the file address
# If the value in the column is literal, it branches directly by literal category
def buildDecisionTree(df, root):
# Identify the different charForResp
charForResp = "'"
if algorithm == 'Regression':
charForResp = ""
tmp_root = root * 1
df_copy = df.copy()
# Output the winning column of the decision tree, enter a list,
# and output the column name of the decision column in the list
winner_name = findDecision(df)
# Determines whether the winning column is a number or a character
numericColumn = False
if dataset_features[winner_name] != 'object':
numericColumn = True
# To ensure the integrity of the original data and prevent the data from changing,
# mainly to ensure that the data of other columns besides the winning column does not change,
# so as to continue the branch in the next step.
columns = df.shape[1]
for i in range(0, columns - 1):
column_name = df.columns[i]
if df[column_name].dtype != 'object' and column_name != winner_name:
df[column_name] = df_copy[column_name]
# Find the element in the branching column
classes = df[winner_name].value_counts().keys().tolist()
# Traversing all classes in the branch column has two functions:
# 1. Display which class is currently traversed to; 2. Determine whether the current class is already leaf node
for i in range(0, len(classes)):
# Find the Subdataset as in FindDecision, but discard this column of the current branch
current_class = classes[i]
subdataset = df[df[winner_name] == current_class]
# At the same time, the data of the first branch column is discarded and the remaining data is processed
subdataset = subdataset.drop(columns=[winner_name])
# Edit the display situation. If it is a numeric feature, the character conversion has been completed when searching for branches.
#If it is not a character feature, it is displayed with column names
if numericColumn == True:
compareTo = current_class # current class might be <=x or >x in this case
else:
compareTo = " == '" + str(current_class) + "'"
terminateBuilding = False
# -----------------------------------------------
# This determines whether it is already the last leaf node
if len(subdataset['Decision'].value_counts().tolist()) == 1:
final_decision = subdataset['Decision'].value_counts().keys().tolist()[
0] # all items are equal in this case
terminateBuilding = True
# At this time, only the Decision column is left, that is, all the segmentation features have been used
elif subdataset.shape[1] == 1:
# get the most frequent one
final_decision = subdataset['Decision'].value_counts().idxmax()
terminateBuilding = True
# The regression tree is judged as leaf node if the number of elements is less than 5
#elif algorithm == 'Regression' and subdataset.shape[0] < 5: # pruning condition
# Another criterion is to take the standard deviation as the criterion and the sample mean in the node as the value of the node
elif algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4:
# get average
final_decision = subdataset['Decision'].mean()
terminateBuilding = True
# -----------------------------------------------
# Here we begin to output the branching results of the decision tree.。
print(formatRule(root), "if ", winner_name, compareTo, ":")
# -----------------------------------------------
# check decision is made
if terminateBuilding == True:
print(formatRule(root + 1), "return ", charForResp + str(final_decision) + charForResp)
else: # decision is not made, continue to create branch and leafs
# The size of the indent at display represented by root
root = root + 1
# Call this function again for the next loop
buildDecisionTree(subdataset, root)
root = tmp_root * 1
Code:
# call the function
buildDecisionTree(df, root)
Output:
Regression tree result
CART tree result