This is an online AI System Course to help students learn the whole stack of systems that support AI, and practice them in the real projects. In this course, we will use terms AI-System and System for AI alternately.
This course is one of the AI-related course in 微软人工智能教育与共建社区. Under the A-基础教程 module. The course numbe an name are A6-人工智能系统.
Welcome to A-基础教程 module to access more related content.
It is strongly recommended that learners who want to learn or consolidate the core knowledge of artificial intelligence, first learn A2-神经网络基本原理简明教程,also known as the 9-step learn Neural Network。It will bring great help to the study of this course.
In recent years, the rapid development of artificial intelligence, especially deep learning technology, is inseparable from the continuous progress of hardware and software systems. In the foreseeable future, the development of artificial intelligence technology will still rely on a joint innovation model that combines computer systems and artificial intelligence. Computer systems are now empowering artificial intelligence with a larger scale and higher complexity. This requires not only more system innovation, but also systematic thinking and methodology. At the same time, artificial intelligence in turn provides support for the design of complex systems.
We have noticed that most of the current artificial intelligence-related courses, especially deep learning and machine learning related courses, mainly focus on related theories, algorithms or applications, but system-related courses are rare. We hope that the course of artificial intelligence systems can make artificial intelligence related education more comprehensive and in-depth, so as to jointly promote the cultivation of talents that intersect artificial intelligence and systems.
This course aims to help students:
-
Completely understand the computer system architecture that supports deep learning, and learn the system design under the full life cycle of deep learning through practical problems.
-
Introduce cutting-edge systems and artificial intelligence research work, including AI for Systems and Systems for AI, to help senior undergraduates and graduate students better find and define meaningful research questions.
-
Design experimental courses from the perspective of system research. Encourage students to implement and optimize system modules by operating and applying mainstream and latest frameworks, platforms and tools to improve their ability to solve practical problems, not just understanding the use of tools.
Prerequisites: C/C++/Python, Computer Architecture, Introduction to algorithms
The course mainly includes the following three modules:
The first part is the basic knowledge of artificial intelligence and a full-stack overview of artificial intelligence systems; and the systematic design and methodology of deep learning systems.
The second part of the advanced courses includes the most cutting-edge systems and artificial intelligence research fields.
The third part is the supporting experimental courses, including the most mainstream frameworks, platforms and tools, and a series of experimental projects.
The content of the first part will focus on basic knowledge, while the content of the other two parts will be dynamically adjusted with the technological progress of academia and industry. The content of the latter two parts will be organized in a modular form to facilitate adjustment or combination with other CS courses (such as compilation principles, etc.) as advanced lectures or internship projects.
The design of this course will also draw on the research results and experience of Microsoft Research Asia in the intersection of artificial intelligence and systems, including some platforms and tools developed by Microsoft and the research institute. The course also encourages other schools and teachers to add and adjust more advanced topics or other experiments according to their needs.
Lectures have two parts--basic courses and advanced courses. The first part is focus on basic theories, from lesson 1 to 6, while the second part involves more cutting-edge research, from lesson 7 to 14.
Basic Courses
Course No. | Lecture Name | Remarks |
1 | Introduction | Overview and system/AI basics |
2 | System perspective of Systems for AI | Systems for AI: a historic view; Fundamentals of neural networks; Fundamentals of Systems for AI |
3 | Computation frameworks for DNN | Backprop and AD, Tensor, DAG, Execution graph. Papers and systems: PyTorch, TensorFlow |
4 | Computer architecture for Matrix computation | Matrix computation, CPU/SIMD, GPGPU, ASIC/TPU Papers and systems: Blas, TPU |
5 | Distributed training algorithms | Data parallelism, model parallelism, distributed SGD Papers and systems: PipeDream |
6 | Distributed training systems | MPI, parameter servers, all-reduce, RDMA Papers and systems: Horovod |
7 | Scheduling and resource management system | Running dnn job on cluster: container, resource allocation, scheduling Papers and systems: Kubeflow, OpenPAI,Gandiva, HiveD |
8 | Inference systems | Efficiency, latency, throughput, and deployment Papers and systems: TensorRT, TensorflowLite, ONNX |
Advanced Courses
Course No. | Course Name | Remarks |
9 | Computation graph compilation and optimization | IR, sub-graph pattern match, Matrix multiplication and memory optimization Papers and systems: XLA, MLIR, TVM, NNFusion |
10 | Efficiency via compression and sparsity | Model compression, Sparsity, Pruning |
11 | AutoML systems | Hyper parameter tuning, NAS Papers and systems: Hyperband, SMAC, ENAS, AutoKeras, NNI |
12 | Reinforcement learning systems | Theory of RL, systems for RL Papers and systems: AC3, RLlib, AlphaZero |
13 | Security and Privacy | Federated learning, security, privacy Papers and systems: DeepFake |
14 | AI for systems | AI for traditional systems problems, for system algorithms Papers and systems: Learned Indexes, Learned query path |
Labs also have two parts: The first part is configured to make sure students can run most of Labs at local machine. The advanced part may need a small cluster (local or on Cloud) with GPU support.
Basic Labs
Lab No. |
Lab Name | Remarks |
Lab 1 | A simple end-to-end AI example, from a system perspective |
Understand the systems from debug info and system logs |
Lab 2 | Customize operators | Design and implement a customized operator (both forward and backward) in python |
Lab 3 | CUDA implementation | Add a CUDA implementation for the customized operator |
Lab 4 | AllReduce implementation | Improve AllReduce on Horovod: implement a lossy compression (3LC) on GPU for low-bandwidth network |
Lab 5 | Configure containers for customized training and inference | Configure containers |
Advanced Labs
Lab No. |
Lab Name | Remarks |
Lab 6 | Scheduling and resource management system | Get familiar with OpenPAI or KubeFlow |
Lab 7 | Distributed training | Try different kinds of all reduce implementations |
Lab 8 | AutoML | Search for a new neural network structure for Image/NLP tasks |
Lab 9 | RL Systems | Configure and get familiar with one of the following RL Systems: RLlib, … |
The following lists the relevant courses in the direction of artificial intelligence systems in other schools and institutions.
<TBD>
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
Privacy information can be found at https://privacy.microsoft.com/en-us/
Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.