At CDW’s BTEX 2021 virtual event, Michael Traves, Principal Field Solutions Architect at CDW Canada presented on the uses of artificial intelligence and machine learning in a business environment. Here are some of the highlights.
What is artificial intelligence, and how is it used in business?
“Artificial intelligence is broadly defined as any simulation of human intelligence,” says Traves. “Machine learning is typically what we find customers engaged in. It’s an approach to artificial intelligence that involves building models and training computational runs, inferencing, but involves a fair amount of data. Depending on how much data, that’s where the concept of deep learning comes in.”
When it comes to deep learning, Traves describes it as more computationally intensive, going beyond what a CPU would provide. This is where customers would require graphics processing units (GPUs) or specialized hardware to meet their needs.
Defining data science
“Data science is the scientific methods, algorithms and systems to extract insights and knowledge from Big Data,” says Traves. “You’re dealing with very large data sets, machine learning, deep learning tasks and statistical methods.”
“Data science encompasses AI, machine learning and deep learning. But deep learning is a subset of machine learning, machine learning is a subset of AI, and there’s a number of areas of research” including fraud detection, sentiment analysis and speech-to-text.
Foundational technology of artificial intelligence
Here are some of the key hardware components that power AI technology:
Graphics processing unit (GPU): A processor specially designed for the types of calculations needed in computer graphics; efficient for programming problems with parallelization in the deep learning and machine learning space.
Field programmable gate array (FPGA): This general-purpose device can be reprogrammed at the logic gate level.
Application-specific integrated circuit (ASIC): Designed to be very effective for one application only.
Tensor Processing Unit (TPU): The name of Google’s architecture for machine learning.
NVIDIA BlueField Data Processing Unit (DPU): Combines ConnectX network adapter with an array of Arm cores, offering purpose-built hardware acceleration engines with full data centre infrastructure on chip programmability.
Machine learning from development to production
“We try to help customers on this journey from development to production,” says Traves. “Many may be starting with a workstation that you’re putting a GPU in. That may be a workstation on your desk, or one you’ve purchased specifically for this purpose and are sharing it within a workgroup. But eventually, as you start moving from that experimentation phase into doing models and iterations on training runs, you’re going to want to move something into the data centre, probably with a different class of GPU.”
“That’s where we get into training larger systems and networking those large systems together so they can run jobs in parallel. This means multiple computational loads, multiple GPUs and a shared data repository they’re going to be accessing to run workloads in parallel with each other.”
“Once we’ve done that model training at scale, we’ll want to do inferencing, which is taking results of a model that have been successful and pushing that out to your end-users to consume. There’s not as much data there, it’s on-demand, client-facing, maybe on a webpage that users are accessing, or maybe on your cell phone.”
What makes up an artificial intelligence platform
When it comes to building out an AI platform, there are a few things to keep in mind. You’ll want to consider workload types, orchestration, platform types, data sources, portability and scale.
- Interactive (user is creating something with a playbook), ML pipeline (submit the job with the data and it runs against the cluster) or inferencing (client-facing)
- Development, training and production environments
- Scheduling workloads to run on a particular node vs. orchestration, i.e. having it run wherever resources happen to be available
- How to provide auditability so that users are getting access to the equipment in a fair way
- Development workstation (at home or in the office), or at the edge on a mobile device
- On-premise cluster (more cost-effective for training), public cloud (tools have been provided for you, but more costly than maintaining own environment), managed platform (someone else is running this for you, providing the tools as a service)
Portability and Scale
- Containers can be deployed anywhere at scale (on-prem, on your desktop, in the cloud) and have it be the same code
- Able to leverage serverless functions
- Data gravity – where is the bulk of the data that you’re using? If there’s a lot of data, computation should be run close to the data
- How am I going to get that data to the public cloud, if that’s where the tools are?
- Data engineering – how to bring data into an environment to do training i.e. ETL (transfer data into a format that’s usable), batch updates of the data and stream processing updates of data
- Every time you update data, you have to decide if you want to rerun your model, and validate that before you push into production
Kubernetes vs. Slurm: job scheduling or orchestration?
“Kubernetes is very good for inferencing,” says Traves. “Slurm is more about scheduling, and there’s stronger auditability around who gets to use what. A research or use case cluster might use Slurm, whereas a production-facing cluster might use Kubernetes.” Traves provides a breakdown of the two platforms, as follows:
Pros and Cons of Kubernetes
- Orchestrates scheduling, management and health of containers (i.e. Docker)
- Excellent for web services (i.e. Inference Server)
- Not made with AI training in mind
- Should extend with a platform for AI training (i.e. Kubeflow)
- More difficult to configure user access, permissions and security
Benefits of Slurm for job training
- Schedules jobs to run on a subset of cluster resources
- Excellent for AI training
- Meant for highly performant work, i.e. multinode jobs leveraging Infiniband networking
- Closely tied to *nix systems, easy to integrate with existing auth and security mechanisms
What is MLOps?
“MLOps is a practice between data scientists, DevOps and machine learning engineers,” says Traves. “It’s designed to increase automation, and that’s what CI/CD does in the DevOps space. It’s all about automating, infrastructure as code, being able to push code through a pipeline and when your code changes, you push the updates through.”
“It automatically gets tested at each step, pushed to your test, dev and production environments when it makes sense, and you’re leveraging that software development lifecycle, continuous integration/continuous delivery, orchestration, monitoring what’s happening at each step in this process, providing feedback to your developers. It’s all the same concepts as what you deal with in a DevOps environment, with the machine learning in the front, which is really the model.”
“You want to use significantly more data, and the parameters associated with it, and that’s what you end up pushing through and integrating on.”
How MLOps makes life easier for data scientists
“As a data scientist, you shouldn’t have to learn about Infiniband, hundred-gig networking or all-flash storage,” says Traves. “As opposed to becoming a platform engineer around machine learning, what you really want to do is run your models, consume services. You want things to be easy, predictable, repeatable. You want it to be automated and when you change your model or data, you want to be able to iterate on that, without really changing anything else.”
“MLOps is about building that workflow to allow for experimentation, for you to iterate and retrain as necessary. It’s providing that framework for data scientists and analysts to submit jobs, and to package all of that in a uniform container that you can now take and run wherever data happens to be accessible, and have access to the right resources to run so that things are packaged and available.”
“As a data scientist, you don’t want to be ML platform engineers. So the platform should be supported by Operations or a managed platform that lives on-prem or in the cloud. Operations may have the foundational knowledge, but they’re really pulling from software engineering and DevOps methodologies.”
Make sure to bookmark this page for more coverage of BTEX 2021.