Report on Machine Learning for Autonomous Driving

Table of Contents

Day 1

The basic concepts of autonomous driving and layouts of the specialization were introduced on the first day.

Above all, the driving tasks. There are three main driving tasks:

  1. Perceiving the environment
  2. Planning how to reach from point A to point B
  3. Controlling the vehicle

Besides, there is an important term ODD, namely Operational Design Domain. In short, ODD means the operating conditions under which a given driving automation system or its functions are specifically designed to operate.

When we try to classify driving system automation, there are 3 points.

  1. Driver attention requirements
  2. Driver action requirements
  3. What exactly makes up a driving task
    1. lateral control like steering
    2. longitudinal control like braking and accelerating
    3. OEDR, namely object and event detection and response
    4. planning
      1. long term(route)
      2. short term
    5. miscellaneous

Thus we rely on assessing the level of these abilities to evaluate the degree of automation of driving.

  • L0, namely no automation
  • L1, driving assistance
  • L2, partial driving automation, which is able to do both the lateral and longitudinal control
  • L3, conditional driving automation
  • L4, high driving automation, which can do the lateral and longitudinal control, OEDR and fallback.
  • L5, full driving automation, on the basis of L4 it can also do unlimited ODD.

Perception of a vehicle includes two things, identification & understanding motion.

Day 2

Autonomous driving can’t get around sensors and computing hardware.


Def A sensor is a device that measures or detects a property of the environment, or changes to a property.

Sensors by categories:

  • exteroceptive: extero means surroundings
  • proprioceptive: proprio means internal

Camera: essential for correctly perceiving environment. The three elements of the camera are resolution, FOV(field of view), dynamic range.

Nowadays, balancing resolution and FOV is necessary.

Stereo camera can estimate the depth from image data, just like human eyes.

Lidar: can generate 3D scene geometry from LIDAR point cloud. The four elements of LIDAR are number of beans, points per second, rotation rate and FOV.

Radar: robust object detection and relative speed estimation. The three elements of RADAR are range, FOV, position and speed accuracy.

Also, balancing range and FOV is necessary.

Ultrasonics: low-cost, short-range, all-weather distance measurement. The three elements of ultrasonics are range, FOV and cost.

GNSS / IMU: namely global navigation satellite systems and inertial measurement units.

Wheel odometry: gain data directly from the wheels.

Computing Hardware

The brain of self-driving. It:

  • takes in all sensor data
  • computes actions
  • do all the processing
  • synchronize different modules and provide a common clock

Autonomous cars are still dangerous, yet the development of autonomous driving is the general trend.

Day 3

We talked about vehicle dynamic modeling.

In general, it is often sufficient to look only at kinematic models of vehicles, while dynamic modeling is more involved, captures vehicle behavior more precisely over a wide operating range.

Because the system has its own coordinate system under different frameworks, we need coordinate transformation, which involves linear transformation.

Once given the transformation matrix, coordinates in one coordinate system can be easily transformed into another coordinate system.

From the data of the sensors, we can also get enough information from classical physics.

The next module mainly involves the content of the control system. By these methods we can make longitudinal control and lateral control.

Day 4

This part is mainly control systems of autonomous driving.

Least Squares

An important method least squares is a mathematical algorithm to estimate parameter values from data.

Estimating resistance is the main example in this course, showing the way how we use linear least squares using matrix.

Given the voltage drop across the resistor at various current values and collect the following data, we determine the parameter R for y=Rx.

We can assume that measured data is equal to actual resistance plus measurement noise. That is, y=x+v, in which x is the actual resistance value.

Now the squared error e^2=(y-x)^2, we can apply the least square method to solve the problem, for \hat{R}=(H^TH)^{-1}H^T\textbf{y}. Of course, bypassing linear algebra and using the derivation to solve algebraically is also feasible, but it is more troublesome in comparison.

In this part, our measurement model is linear, since y=x+v, and measurements are equally weighted, namely we do not suspect that some have more noise than others.

Weighted Least Squares

The formula is \hat{x}=(H^TR^{-1}H)^{-1}H^TR^{-1}\textbf{y}. The derivation process is similar to the above. We just modify the model to \textbf{y}=H\textbf{x}+\textbf{v}.

Recursive Least Squares

In the least squares method, the data is measured all at once and then solved. But in practice, sometimes measurement data is obtained online, that is, measurement data is obtained at all times. For example, an unmanned vehicle needs to judge the motion state of the vehicle in front in real time and obtain its acceleration, so it needs to be measured every moment. Every time new measurement data is obtained, the least squares method needs to be performed to update the vehicle state.

If each update takes the formula to update, using all data for calculation, when the number of measurements is large, the amount of calculation is very large, the efficiency is very low, and all measurement data need to be saved. We can use a recursive method to continue to revise the results each time using new measurement data, which greatly reduces the amount of calculation and storage.

The formula is \hat{x_k}=\hat{x_{k-1}}+K_k(y_k-H_k\hat{x_{k-1})}.

The steps of the algorithm are:

  1. Initialize the estimator.
  2. Set up the measurement model, defining the Jacobian and the measurement covariance matrix.
  3. Update the data.

Maximum Likelihood Estimation

An important assumption for sampling in maximum likelihood estimation is that all samples are independently and identically distributed.

Most of this part is detailed in the Probability Theory.

Day 5

Kalman Filter

We got into state estimation – linear and nonlinear Kalman filters.

The Kalman filter is an efficient recursive filter(autoregressive filter) that can estimate the state of a dynamic system from a series of incomplete and noisy measurements.

In short, the Kalman Filter is similar to RLS but includes a motion model that tells us how the state evolves over time.

The Kalman Filter updates a state estimate through 2 stages:

  1. prediction using the motion model
  2. correction using the measurement model


BLUE means Best Linear Unbiased Estimator.

In general, if we have white, uncorrelated zero-mean noise, the Kalman Filter is the best(i.e., lowest variance) unbiased estimator that uses only a linear combination of measurements.


EKF stands for Extended Kalman Filter, or we say Nonlinear Kalman Filter.

The main idea of EKF is to use Taylor expansion to linearize the nonlinear function, and we do the first-order Taylor expansion to get the tangent of a nonlinear function.

It relies on computing Jacobian matrices, which contain all the first-order partial derivatives of a function.

When it comes to multivariate functions, we should do the EKF separately.

Improved Nonlinear Kalman Filter

Here is just a summary of the slides:

ES-EKF or Error-State Extended Kalman Filter.

The ES-EKF estimates the error state directly and uses it as a correction to the nominal state.

The steps are looping:

  1. Update nominal state with motion model
  2. Propagate uncertainty
  3. If a measurement is available:
    1. Compute Kalman Gain
    2. Compute error state
    3. Correct nominal state
    4. Correct state covariance

ES-EKF has better performance compared to the vanilla EKF for it separates the state into a "large" nominal state and a "small" error state, and is easy to work with constrained quantities(e.g., rotations in 3D).

Limitations of Nonlinear Kalman Filter

Since we use Taylor expansion, linearization error depends on:

  1. How nonlinear the function is
  2. How far away from the operating point the linear approximation is being used

The EKF is prone to linearization error when:

  1. The system dynamics are highly nonlinear
  2. The sensor sampling time is slow relative how fast the system is evolving.

Therefore, it has two important consequences:

  1. The estimated mean state can become very different from the true state
  2. The estimated state covariance can fail to capture the true uncertainty in the state.

Day 6

We need to process the data gained from LiDAR sensing, of course.

You got Inverse Sensor Model and Forwards Sensor Model, in different coordinate Systems.

So far, we have been able to combine all the mentioned data information to create an autonomous vehicle state estimator.

Day 7

The CV(Computer Vision) part begins with visual perception for self-driving cars. We can say that CV starts with the projection from the real world to image. We use the same linear transformation operation to complete the mapping.

Firstly we elaborate several conceptions of the image, then we got into an important part —— image filtering.

For example, if we want to remove noise from image, we can use mean filter or Gaussian filter. Usually, Gaussian filter has the better performance.


Def A convolution is a cross-correlation where the filter is flipped both horizontally and vertically before being applied to the image.

Unlike cross-correlation, convolution is associative.

One application of cross-correlation is template matching. The pixel with the highest response from cross-correlation is the location of the template in an image.

One application of convolution is gradient computation. You define a finite difference kernel, and apply it to the image to get the image gradient.

Detection, Description and Matching

Def Features are points of interest in an image.

The characteristics of points of interest are:

  • Saliency: distinctive, identifiable and different from its immediate neighborhood
  • Repeatability: can be found in multiple images using same operations
  • Locality: occupies a relatively small subset of image space
  • Quantity: enough points represented in the image
  • Efficiency: reasonable computation time

There are several feature detection algorithms like Harris Corners and Harris-Laplace.

A descriptor is an N-dimensional vector that provides a summary of the image information around the detected feature.

Scale-invariant feature transform(SIFT) is a machine vision algorithm used to detect and describe local features in images. It finds extreme points in the spatial scale and extracts their positions, scale and rotation invariants. This algorithm was published by David Lowe in 1999 and summarized in 2004.

SIFT features are based on some local appearance points of interest on objects independent of image size and rotation. Under the condition of modern computers, the recognition speed can be close to real-time calculation.

There are other descriptors, for sure.

Def Feature Matching: Given a feature and its descriptor in one image, find the best match in another image.

The simplest algorithm is Brute Force Feature Matching. The time complexity is \Theta(n^2).

In this part, we define different types of distance functions. For example, we have Sum of squared differences(SSD), Sum of absolute differences(SAD) and Hamming distance.

We can improve the accuracy by defining a distance threshold \delta.

A K-D tree can be used to speed up matching. We can use it directly by cv2.FlannBasedMatcher() in OpenCV.

Day 8

Def A Feedforward Neural Network(FNN) defines a mapping from input x to output y as y=f(x;\theta).

An N layer FNN is represented as the function composition.

x is called the input layer, the result is called the output layer, and the process is called the hidden layer.

Def Training data are pairs of the input x and neural network examples of f^*(x).

Day 9

Batch Gradient Descent

BGD is a type of Gradient Descent. The batch gradient descent method is aimed at the entire data set, and solves the direction of the gradient by calculating all samples.

BGD is relatively slow when the sample set is rather large, but can get the global optimal solution.

Stochastic Gradient Descent

The stochastic gradient descent algorithm can be seen as a special case of mini-batch gradient descent, that is, in the stochastic gradient descent method, the parameters in the model are adjusted according to only one sample at a time.

SGD is pretty fast at the price of accuracy.

Note that:

  • GPUs work better with power of 2 batch sizes
  • Always make sure dataset is shuffled before sampling minibatch
  • Large batch size > 256, while small batch size < 64

Extra Day 1

Let's take it easy and check the No Free Lunch Theorem.

The No Free Lunch Theorem is often thrown around in the field of optimization and machine learning, often with little understanding of what it means or implies.

The theorem states that all optimization algorithms perform equally well when their performance is averaged across all possible problems. It implies that there is no single best optimization algorithm. Because of the close relationship between optimization, search, and machine learning, it also implies that there is no single best machine learning algorithm for predictive modeling problems such as classification and regression.

Data Splits

Based on experience, 70% training, 20% validation, 10% testing is recommended.

Reducing the effect of underfitting / overfitting

  • Underfitting(Training loss is high)
    • Train longer
    • More layer or more parameters per layer
    • Change architecture
  • Overfitting(Generalization gap is large)
    • More training data
    • Regularization
    • Change architecture

About hyperparameter:

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are derived via training.

Learning from Data by Abu Mostafa from CalTech is recommended.

Extra Day 2

We can focus on clustering, unsupervised learning in the near future.

A brief introduction on U-Net model:

U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on the fully convolutional network and its architecture was modified and extended to work with fewer training images and to yield more precise segmentations. Segmentation of a 512 × 512 image takes less than a second on a modern GPU.

It's really important to fix dependencies and choose the right version of the dependency library when debugging the python code, for sure.