AWS Machine Learning Specialty Notes

Jason Y. Liu
20 min readAug 3, 2021

Forward

After the exam, I found this resource with startling accuracy to the questions I had. Hopefully it will be helpful for you!

https://vceguide.com/amazon/aws-certified-machine-learning-specialty-mls-c01/

General concepts

Machine Learning = supervised learning + unsupervised learning + reinforcement learning

A model is an extremely generic program, made specific by the data used to train it.

Model training algorithms work through an interactive process where the current model iteration is analyzed to determine what changes can be made to get closer to the goal. Those changes are made and the iteration continues until the model is evaluated to meet the goals.

Model inference is when the trained model is used to generate predictions.

A categorical label has a discrete set of possible values, such as “is a cat” and “is not a cat.”

A continuous (regression) label does not have a discrete set of possible values, which means possibly an unlimited number of possibilities.

Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week).

A label refers to data that already contains the solution.

Using unlabeled data means you don’t need to provide the model with any kind of label or solution while the model is being trained.

Outliers are data points that are significantly different from others in the same sample.

Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.

A loss function is used to codify the model’s distance from this goal

Training dataset: The data on which the model will be trained. Most of your data will be here.

Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.

Model parameters are settings or configurations the training algorithm can update to change how the model behaves.

Bag of words: A technique used to extract features from the text. It counts how many times a word appears in a document (corpus), and then transforms that information into a dataset.

Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model. E.g. one-hot encoding

Stop words: A list of words removed by natural language processing tools when building your dataset. There is no single universal list of stop words used by all-natural language processing tools.

Epoch: refers to a single cycle of your training job through the full training dataset

Regularizations: techniques used to reduce the error by fitting a function appropriately on the given training set and avoid overfitting

Techniques that can be either supervised or unsupervised:

· Image classification is the most common application of computer vision in use today. Image classification can be used to answer questions like What’s in this image? This type of task has applications in text detection or optical character recognition (OCR) and content moderation.

· Object detection is closely related to image classification, but it allows users to gather more granular detail about an image. For example, rather than just knowing whether an object is present in an image, a user might want to know if there are multiple instances of the same object present in an image, or if objects from different classes appear in the same image. (e.g. To detect both the cat and the dog present in this image)

· Activity recognition is an application of computer vision that is based around videos rather than just images. Video has the added dimension of time and, therefore, models are able to detect changes that occur over time.

Learning rate

· The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

Model fitting

· Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).

· The overfitting model would look squiggly. This is due to the fact that it is able to learn too much information (even noise) in the training data.

Overfitting?

· The dropout hyperparameter refers to the dropout probability for network layers. A dropout is a form of regularization used in neural networks that reduces overfitting by trimming codependent neurons.

· L1 regularization, by increasing the value of the L1 hyperparameter, is not available to Amazon SageMaker Object2vec. This is used for simple regression models like a Linear learner.

Data augmentation

· a technique to artificially create new training data from existing training data. This is done by applying domain-specific techniques to examples from the training data that create new and different training examples.

· Image data augmentation is perhaps the most well-known type of data augmentation and involves creating transformed versions of images in the training dataset that belong to the same class as the original image.

Impute/Imputation

· is the process of replacing missing data with substituted values.

· The Multiple Imputations by Chained Equations (MICE) algorithm is a robust, informative method of dealing with missing data in your datasets. This procedure imputes or ‘fills in’ the missing data in a dataset through an iterative series of predictive models. Each specified variable in the dataset is imputed in each iteration using the other variables in the dataset. These iterations will be run continuously until convergence has been met. You can manage the missing values in your target and related datasets by leveraging on the automated imputation support in Amazon Forecast.

Synthetic Minority Oversampling Technique (SMOTE)

· an oversampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replacement

· The minority class is oversampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen.

Text cleaning/text preprocessing

· an integral stage of an NLP pipeline. Since most NLP applications deal with unstructured texts (e.g., reviews, news articles, social media posts), you have to give “structure” to them so they can be conveniently processed and understood by your model.

· Some examples of preprocessing are Changing letters into lowercase, Word Tokenization, Stop word removal, HTML tag removal, Stemming, Lemmatization, and so on.

· Changing letters into lowercase will bring uniformity to all available data.

· Stop word removal is the process of removing words that do not hold any significant meaning to a sentence. These are words that are commonly used and repeated (e.g., the, is, are). Removing them will reduce the overall size of your data which will eventually help speed up model training.

· Word Tokenization is the process of splitting a sentence into individual words that can be used as input for the Word2vec model to create embeddings.

Term Frequency — Inverse Document Frequency (TfIdf)

· an algorithm used to convert text data into its numerical representation that can be passed into a machine learning model. The first function (Term Frequency) counts how frequently a word appears in a sentence belonging to a corpus. The second function (Inverse Document Frequency) counts how frequently a word appears in the whole corpus.

· The Tf-Idf is a great way of giving weights to words as it penalizes generic words that commonly appear across all sentences.

· In the example below, we have 4 sentences inside a document. The word “blue” is only present in the first sentence, while the word “horizon” is present in 3 sentences. Notice how the word “blue” and “horizon” each appears 3 times in the whole document, but the “blue” has more weight compared to the “horizon”. Even if a word has a low frequency across all documents, the Tf-Idf vectorizer is still able to capture how special that word is by giving it a higher score.

Model evaluation

· to train a classification model, we can use the Area Under the ROC Curve (AUC) metric to evaluate the model performance. ROC is Receiver Operating Characteristic. Then, use a grid scatter plot to show the correlation between each AUC metric-hyperparameter combination. With this information at hand, you can retune the hyperparameters to attain the model characteristic that you want.

· to evaluate the performance of regression models, can use a scatter plot to visualize the results for each root mean square error (RMSE)-hyperparameter combination

· Mean Absolute Percentage Error (MAPE) used to evaluate regression models. It refers to the percentage of errors between the paired observations expressing the same phenomenon.

· Residual plots is a visualisation technique to measure of how much a regression line vertically misses a data point. It identify whether the model is underestimating or overestimating

Confusion matrix:

Model Accuracy is the fraction of predictions a model gets right.

Recall is how many of the true positives were recalled (found), among all the real positives.

F1 Score

· is a combination of the Precision and Recall metrics. The F1 score is the harmonic mean of the Precision and Recall metrics. This score is based on the Precision and Recall created by the averaging method and is also known as the Macro F1 score. A measure of how accurate the classifier results are for the test data. It is derived from the Precision and Recall values. The F1Score is the harmonic average of the two scores. The highest score is 1, and the worst score is 0.

Data Transformations

- Normalization Transformation: Consider a dataset with a feature called ‘age’ that ranges between 18–35 and a ‘product price’ that ranges between $50 — $5,000. Since the product price has a significantly larger value than the age, the model will treat the product price with “more importance”. This would have a negative impact on the model’s ability to classify data correctly. An example is Feature Scaling, for numeric features such as loan amount and age

- Quantile Binning Transformation is a process used to discover non-linearity in the variable’s distribution by grouping observed values together. It won’t help normalize features with wide-ranging differences.

- N-gram Transformation is simply used for splitting sentences into a list of word/s.

- Orthogonal Sparse Bigram (OSB) Transformation, Just like the N-gram Transformation, this transformation is intended to aid in text string analysis and is just an alternative to the bi-gram transformation.

One hot encoding

· a process by which categorical variables are converted into numerical values that could be provided to ML algorithms to do a better job in prediction. It creates a number (n) of numerical columns, where n is the number of unique values in a selected categorical variable.

· For example, consider a column named shirt_size. Shirts are available in small, medium, large, or extra-large. In this scenario, there are four distinct values for shirt_size. Therefore, ONE_HOT_ENCODING generates four new columns. Each new column is named shirt_size_x, where x represents a distinct shirt_size value.

Unsupervised learning

Clustering. Unsupervised learning task that helps to determine if there are any naturally occurring groupings in the data.

The k-means algorithm

· attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups (see the following figure). You define the attributes that you want the algorithm to use to determine similarity. Another way you can define k-means is that it is a clustering problem that finds k cluster centroids for a given set of records, such that all points within a cluster are closer in distance to their centroid than they are to any other centroid.

· High-dimensional data such as census records often contain similar or redundant information which makes it difficult to visualize or derive actionable conclusions from. You can reduce the number of features in your dataset through PCA by combining similar or redundant features together to form a new, smaller feature set. In doing so, the resulting data would be easier to interpret, and your algorithm will also gain performance improvement.

· The reduced dataset will be used by the k-means algorithm which will create a grouping of similar counties in a transformed feature space

Silhouette coefficient

· a method of interpretation and validation of consistency within clusters of data. A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A score approaching 1 indicates successful identification of discrete non-overlapping clusters.

Neutral Networks

· Input Layer: The first layer in a neural network. This layer receives all data that passes through the neural network.

· Hidden Layer: A layer that occurs between the output and input layers. Hidden layers are tailored to a specific task.

· Output Layer: The last layer in a neural network. This layer is where the predictions are generated based on the information captured in the hidden layers.

Semantic segmentation

· another common application of computer vision that takes a pixel-by-pixel approach. Instead of just identifying whether an object is present or not, it tries to identify down the pixel level which part of the image is part of the object.

Generative AI

· Generator: A neural network that learns to create new data resembling the source data on which it was trained.

· Discriminator: A neural network trained to differentiate between real and synthetic data.

· Generator loss: Measures how far the output data deviates from the real data present in the training dataset.

· Discriminator loss: Evaluates how well the discriminator differentiates between real and fake data.

Collaborative filtering

· based on (user, item, rating) tuples. So, unlike content-based filtering, it leverages other users’ experiences. The main concept behind collaborative filtering is that users with similar tastes (based on observed user-item interactions) are more likely to have similar interactions with items they haven’t seen before.

· Compared to content-based filtering, collaborative filtering provides better results for diversity (how dissimilar recommended items are); serendipity (a measure of how surprising the successful or relevant recommendations are); and novelty (how unknown recommended items are to a user). However, collaborative filtering is more computationally expensive and more complex and costly to implement and manage. Though some algorithms used for collaborative filtering such as factorization machines are more lightweight than others. Collaborative filtering has a cold start problem as well, since it has difficulty recommending new items without a large amount of interaction data to train a model.

Principal Component Analysis (PCA)

· a popular unsupervised learning technique used by data scientists primarily for dimensionality reduction in numerous applications ranging from stock market prediction to medical image classification

t-SNE

· t-distributed Stochastic Neighbor Embedding (t-SNE)

· a non-linear dimensionality reduction algorithm used for exploring high-dimensional data

· preprocess a dataset that contains highly correlated variables.

Random Cut Forest (RCF)

· For anomaly detection

Softmax activation function

· a mathematical equation that converts a vector of real numbers into a vector of a probability distribution whose sum is equal to 1.

Latent Dirichlet Allocation (LDA) algorithm

· an unsupervised algorithm mainly used to discover a user-specified number of topics shared by documents within a text corpus. For example, you may uncover new topics/categories from a document by determining the occurrence of each word.

Supervised learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples

Support Vector Machines (SVM) with RBF kernel

· a supervised algorithm mainly used for classification tasks. It uses decision boundaries to separate groups of data.

· The SVM with Radial Basis Function (RBF) kernel is a variation of the SVM (linear) used to separate non-linear data. Separating randomly distributed data in a two-dimensional space can be a daunting and difficult task. The RBF Kernel provides an efficient way of mapping data (e.g., 2-D) into a higher dimension (e.g, 3-D). In doing so, we can conveniently apply the decision surface/hyperplane where we mainly based our model predictions.

Factorization machine

· a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically.

· For example, in a click prediction system, the factorization machine model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

Sequence-to-Sequence algorithm

· a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio), and the output generated is another sequence of tokens. Example applications include machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens).

K-nearest neighbor (K-NN) algorithm

· a simple, supervised machine learning algorithm that mainly used to model classification tasks

· can be used to solve both classification and regression problems

Reinforcement learning

Basic terms: Agent, environment, state, action, reward, and episode

Example: AWS DeepRacer

examples of real-world RL include:

· Industrial robotics

· Fraud detection

· Stock trading

· Autonomous driving

AWS Services

Kinesis Client Library (KCL)

- For developing custom consumer applications

Amazon Elastic Inference

· allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Sagemaker instances or Amazon ECS tasks, to reduce the cost of running deep learning inference by up to 75%. Amazon Elastic Inference supports TensorFlow, Apache MXNet, PyTorch, and ONNX models.

· It allows you to attach just the right amount of GPU-powered inference acceleration to any Amazon EC2 or Amazon SageMaker instance type or Amazon ECS task. This means you can now choose the instance type that is best suited to the overall compute, memory, and storage needs of your application, and then separately configure the amount of inference acceleration that you need.

Amazon Kinesis Data Firehose

· buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations. You can configure the buffer size and buffer interval while creating your delivery stream.

· Buffer size is in MBs and ranges from 1MB to 128MB for Amazon S3 destination and 1MB to 100MB for Amazon Elasticsearch Service destination. Buffer interval is in seconds and ranges from 60 seconds to 900 seconds. Please note that in circumstances where data delivery to destination is falling behind data writing to a delivery stream, Firehose raises buffer size dynamically to catch up and make sure that all data is delivered to the destination.

AWS Glue

You can create streaming extract, transform, and load (ETL) jobs that run continuously, consume data from streaming sources like Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

As you process streaming data in an AWS Glue job, you have access to the full capabilities of Spark Structured Streaming to implement data transformations, such as aggregating, partitioning, and formatting as well as joining with other data sets to enrich or cleanse the data for easier analysis. For example, you can access an external system to identify fraud in real-time or use machine learning algorithms to classify data or detect anomalies and outliers. In other words, AWS glue can do streaming ETL before saving it as Parquet in S3!

AWS Data Pipeline

· a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

· To train a model in SageMaker, you create a training job.

· Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3.

· Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.

· Amazon SageMaker renews the model artifact and updates the endpoint.

· The converted CSV is available for ad hoc queries with Amazon Athena.

· Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.

Amazon SageMaker

Most Amazon SageMaker algorithms work best when you use the optimized protobuf recordIO data format for training. Using this format allows you to take advantage of Pipe mode. In Pipe mode, your training job streams data directly from Amazon Simple Storage Service (Amazon S3). Streaming can provide faster start times for training jobs and better throughput.

Pipe input mode, your data is fed on-the-fly into the algorithm container without involving any disk I/O. This approach shortens the lengthy download process and dramatically reduces startup time. It also offers generally better read throughput than File input mode. This is because your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block Store (EBS) volume size limit.

Amazon SageMaker Batch Transform

· To get inferences for an entire dataset, use Amazon SageMaker Batch Transform. With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Amazon SageMaker saves the inferences in an S3 bucket that you specify when you create the batch transform job.

· You can use Amazon SageMaker Batch Transform to exclude attributes before running predictions. You can also join the prediction results with partial or entire input data attributes when using data that is in CSV, text, or JSON format. This eliminates the need for any additional pre-processing or post-processing and accelerates the overall ML process.

Amazon SageMaker Hosting Services is only used for getting inference one at a time and not for an entire dataset.

Amazon SageMaker Neo is simply used for ML-inference optimization.

Amazon SageMaker Inference Pipeline

· a tool used to chain different stages of your inference.

· composed of a linear sequence of two to five containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pre-trained SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.

· You can add SageMaker Spark ML Serving and Scikit-learn containers that reuse the data transformers developed for training models. The entire assembled inference pipeline can be considered as a SageMaker model that you can use to make either real-time predictions or to process batch transforms directly without any external preprocessing.

Amazon SageMaker Ground Truth

· a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning.

Amazon SageMaker Object2Vec algorithm

· a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space.

· You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.

Lifecycle configuration feature in Amazon SageMaker

· write a script to install a list of libraries and configure the scripts to automatically execute every time your notebook instance is started

· you can automate these customizations to be applied at different phases of the lifecycle of an instance

· Similarly, you can choose to automatically run the script only once when the notebook instance is created.

AWS Database Migration Service (AWS DMS)

· a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud, between on-premises instances (through an AWS Cloud setup), or between combinations of cloud and on-premises setups. With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync.

Amazon FSx for Lustre file system

· if you need faster startup

· can be used for subsequent iterations of training jobs on Amazon SageMaker, preventing repeated downloads of common Amazon S3 objects

· natively integrated with Amazon S3

Amazon Comprehend

· a natural language processing (NLP) service that uses machine learning to find meaning and insights in text.

· There’s no need to deploy or train your own model as all of these services are fully managed and are readily-available through APIs.

· Custom entity recognition extends the capability of Amazon Comprehend by enabling you to identify new entity types not supported as one of the preset generic entity types. This means that in addition to identifying entity types such as LOCATION, DATE, PERSON, and so on, you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs. Creating a custom entity recognition model is a more effective approach, compared to using string matching or regular expressions to identify entities. For example, to extract product codes, it would be difficult to enumerate all possible patterns to apply string matching. But a custom entity recognition model can learn the context where those product codes are most likely to appear and then make such inferences even though it has never previously seen the exact product codes. As well, typos in product codes and the addition of new product codes can still be expected to be caught by Amazon Comprehend’s custom entity recognition model but would be missed when using string matches against a static list.

Amazon Rekognition Video

· can detect objects, scenes, faces, celebrities, text, and inappropriate content in videos. You can also search for faces appearing in a video using your own repository or collection of face images.

Amazon Transcribe

· Speech to text

· There’s no need to deploy or train your own model as all of these services are fully managed and are readily-available through APIs

Amazon Translate

· Neural Machine Translation (MT) service for translating text between supported languages.

· There’s no need to deploy or train your own model as all of these services are fully managed and are readily-available through APIs.

Amazon Lex

· chatbot AI

Amazon Polly

· Pronunciation Lexicons give you additional control over how Amazon Polly pronounces words uncommon to the selected language. For example, you can specify the pronunciation using a phonetic alphabet.

Amazon API Gateway

· a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. API Gateway can be used to present an external-facing, single point of entry for Amazon SageMaker endpoints.

· API Gateway can be used to front an Amazon SageMaker inference endpoint as (part of) a REST API, by making use of an API Gateway feature called mapping templates. This feature makes it possible for the REST API to be integrated directly with an Amazon SageMaker runtime endpoint, thereby avoiding the use of any intermediate compute resource (such as AWS Lambda or Amazon ECS containers) to invoke the endpoint. The result is a solution that is simpler, faster, and cheaper to run.

--

--