Part 1. Softmax regression [60 points] In this part, you will implement softmax regression (in problem1.py) with stochastic gradient descent in python3. We provide the following files: a) problem1.py – You will implement several functions of softmax regression. Do not change the input and the output of the functions. b) test1.py – This file includes unit tests. Run this file by typing ‘pytest -v test1.py’ in the terminal. No modification is required. Part 2. Adding CNN and fully connected layers to recognize handwritten digits on PyTorch [40 points] In this part, you will deal with the MNIST Database [1]. The MNIST Database is a collection of samples of handwritten digits from many people, originally collected by the National Institute of Standards and Technology (NIST), and modified to be more easily analyzed computationally. We will use a tutorial and sample software provided: • Read and run a tutorial [2] to be familiar with how to add CNN layers into PyTorch. • Download hw4-part2.ipynb and run it to be familiar with the code. Currently it contains two fully connected layers with softmax. • Your job is to add one CNN layer with one pooling layer before the two fully connected layers with softmax. Refer to the detailed instruction about the CNN layer. The main difference between the tutorial and this given hw4_part2.ipynb is the input image of hw4_part2.ipynb has only 1 channel (i.e., gray scale). Report results of two fully connected layers without CNN and with CNN. • Then, experiment with at least 3 alternative network topologies and hyper-parameters (e.g., different # of CNN/fully-connected layers, # of epochs, # of hidden units, learning rate, batch size, and different activation functions). • Save and summarize the results and report them. • Through the experiment, what is the best configuration? What prediction accuracy on the test set you got? What did you learn? What to turn in: Submit to Canvas your problem1.py and pdf document for part 2. This is an individual assignment. [1] MNIST Database: https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html [2] Tutorial: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
For this assignment, you will: (70 pts) Implement linear regression with gradient descent (30 pts) Make predictions by using your implementation Part 1: Implement linear regression with gradient descent In this problem, you will implement the linear regression algorithm in python3. We provide the following files: a) linear_regression.py – You will implement several functions. As we discussed in class, implement the functions by using vectorization. You may refer to matrix calculus here: https://en.wikipedia.org/wiki/Matrix_calculus Do not change the input and the output of the functions. b) test.py – This file includes unit tests. Run this file by typing ‘pytest -v test.py’ in the terminal as you did in homework 1 in order to check whether all of the functions are properly implemented. No modification is required. Part 2: Make predictions by using your implementation Given training and test sets, you will make predictions of test examples by using your linear regression implementation (linear_regression.py). We provide the following file: a) application.py – write your code in this file. Do not change X and y. Please play with the parameters alpha and number of epochs to make sure your testing loss is smaller than 1e-2 (i.e., 0.01). Report your parameters, training loss and testing loss. In addition, based on your observations, report a relationshp between alpha and number of epochs. Note that a single epoch means the single time you see all examples in the training set. What to turn in: Submit to Canvas your linear_regression.py, application.py and a pdf document for part 2. This is an individual assignment.
For this assignment, you will: (0 pts) Get started with Python (60 pts) Implement Decision Tree with Discrete Attributes (40 pts) Credit Risk Prediction Part 0: Getting Started with Python As in all the homeworks this semester, you will be using Python. So let’s get started first with our Python installation. Python Installation and Basic Configuration First off, you will need to install a recent version of python 3 in here. There are lots of online resources for help in installing Python. Alternatively, there is a nice collection called Anaconda, that comes with Python plus tons of helpful packages that we may use down the line in this course: Anaconda: which has install instructions for Windows, Linux, and Mac OSX. Installing Python Libraries (optional) You will may need to install python libraries. To manage your Python installations, we recommend pip. Pip is a tool for installing and keeping track of python packages. It is a replacement for easy_install which is included with python. It’s a bit smarter than easy_install, and gives better error messages, so you probably want to use it. You can install pip and the two packages we currently need by running these commands: > easy_install pip > pip install -r reqs.pip Then, you may install other Python libraries such as NumPy by typing ‘pip install numpy’ Part 1: Implement Decision Tree with Discrete Attributes (60 pts) In this assignment, you will implement the decision tree algorithm for a classification problem in python 3. We provide the following three files: a) data1.csv – You will load the file, build a tree, and evaluate its performance. The first row of the file is the header (including the names of the attributes). In the remaining rows, each row represents an instance/example. The first column of the file is the target label. b) part1.py – You will implement several functions. Do not change the input and the output of the functions. c) test1.py – This file includes unit tests. Run this file by typing ‘pytest -v test1.py’ in the terminal to check whether all of the functions are properly implemented. No modification is required. Part 2: Credit Risk Prediction (40 pts) Let’s assume that you work for a credit card company. Given the sample credit dataset (credit.txt) as a training set, your job is to build a decision tree and make risk prediction of individuals. The target/class variable is credit risk described as high or low. Features are debt, income, marital status, property ownership, and gender. Task 2-1: Draw your decision tree and report it. You may use visualization tools (e.g., Graphviz) or use text. You might find it easier if you turn the decision tree on its side, and use indentation to show levels of the tree as it grows from the left. For example: outlook = sunny | humidity = high: no | humidity = normal: yes outlook = overcast: yes outlook = rainy | windy = TRUE: no | windy = FALSE: yes Feel free to print out something similarly readable if you think it is easier to code. Apply the decision tree to determine the credit risk of the following individuals: Name Debt Income Married? Owns Property Gender Tom low low no Yes Male Ana low medium yes Yes female Report a snapshot of your decision tree, and predicted credit risk of Tom and Ana. Task 2-2: How does your decision tree change if Sofia’s credit risk is high instead of low as recorded in the training data? Given the decision tree constructed from the original dataset, if existing, name any feature not playing a role in the decision tree. What to turn in: Submit to Canvas your part1.py, and a pdf document for part2. This is an individual assignment.
Part 1. Ranked retrieval using PageRank (80 points) In this assignment, you will crawl a collection of web pages, calculate their PageRank scores, and then use these PageRank scores in tandem with a boolean query operator to rank the results. Start by downloading hw3.zip and decompressing it; you should find two python files. a) cs547.py – Just like HW1, this helper class will be used to represent a student’s identifying information. Any assignments without an instantiated student object of type Student will not be graded. You do NOT need to modify this file. b) hw3.py: This is where you will fill out three functions: index_dir(…), tokenize(…) and ranked_search(…). index_url(self, url): This function crawls through a web directory of html files and generates an index of the contents. It should also extract hyperlink references for use in computing PageRank scores. tokenize(self, text): This function converts a string of terms into a list of valid tokens. For the purposes of this assignment, a valid token consists of only English alphabet characters (a-z, A-Z) or numbers (0- 9). All other characters are considered as token delimiters. Convert all letters to lower case. Treat the html source as your input document for purposes of this assignment. ranked_search(self, text): This function searches for the terms in “text” in our index and returns at most the 10 highest-ranked results based on their PageRank scores. Return a list of tuples containing (url, the PageRank score) of relevant search results. Note that returned url documents should include all terms in the query/text. The returned tuples should be in descending order by the PageRank score. Use a teleportation factor of 0.1 in your PageRank calculation (meaning that 90% of the time the random surfer follows links on a page and that 10% of the time, the random walker gets bored and randomly teleports to any page with equal probability). Feel free to create member variables and inner functions in the PageRankIndex class, but your code should satisfy the basic requirements of index_dir(…), tokenize(…) and ranked_search(…). We have provided a web based corpus and an index page listing all documents in this corpus in here (http://web.cs.wpi.edu/~kmlee/cs547/new10/index.html). Treat this index as the root node to our webgraph. Pull out anchor tags based on and not based on http:// since URLs may exist in the document but not as an anchor tag. For example, an anchor tag is Wikipedia. Extract http://en.wikipedia.org/wiki/Main_Page. If there is http://en.wikipedia.org/wiki/page in the html without tag, you will ignore it and do not consider it as a valid hyperlink reference. Each valid anchor tag in the html corpus should be treated as a node in the webgraph. If that node doesn’t exist in our web corpus, treat it is a leaf node. Beautiful Soup is a Python library for pulling data out of HTML and XML files. Install the Beautiful Soup library to parse the html documents. You may install and use the Numpy library in your solution in order to deal with a large matrix, as that will be available in the grading environment. It is not necessary, but may be useful and can be found at http://numpy.scipy.org/. You should be able to manually check whether your code is doing what you think it should. Once you submit your final code, we will evaluate it by constructing an index and issuing several queries (single term and multiple terms). Note that multiple terms mean more than one term and will be ANDed together for purposes of document retrieval. Part 2. Traditional Search Engines vs. Generative AI-based Search Engines (20 points) In this assignment, you will compare traditional search engines (e.g., Google, Bing) with generative AI-based search engines (e.g., perplexity.ai and you.com). Specifically, you’ll need to try at least three different search queries, report which queries you used, and assess the pros and cons of both types of search engines. Finally, provide a conclusion on when it is most appropriate to use traditional search engines vs. generative AI-based search engines. What to turn in: Rename hw3.py to hw3_firstname_lastname.py (e.g., hw3_elon_musk.py) and submit to Canvas your hw3.py file and a pdf file for part 2. This is an individual assignment, but you may discuss general strategies and approaches with other members of the class (refer to the syllabus for details of the homework collaboration policy). At the top of hw3.py you will see a list of COLLABORATORS. Please fill this out with the names of classmates you consulted and the nature of your discussion.
QUESTION 1 [55 Marks] Climate change stands as one of the most urgent challenges confronting our planet today. To effectively comprehend and address this critical issue, access to precise and comprehensive data regarding global temperatures and other climate-related factors is indispensable. In this regard, you serve as a data analyst at the National Aeronautics & Space Administration (NASA) and are engaged in researching Earth’s climate and temperature. Your work involves utilizing datasets sourced from satellites and groundbased sensors. You have been entrusted with a dataset “earth_surface_temperatures .csv” encompassing surface temperature data for various countries worldwide, spanning from Dec 1743 to December 2020. Your mission is to conduct an in-depth Exploratory Data Analysis (EDA) on this dataset. This analysis aims to extract insights and answer crucial questions about the data by delving into trends and patterns. To achieve this, you are expected to: a. Identify and rectify any missing values in the data using appropriate techniques. [5 Marks] b. Transform the Years and Month columns into a single column labeled “Date” in the MM-YYYY format, with a datetime64[ns] data type. For example, the year 1848 and month 5 should be unified as a single value, such as 5-1848. [5 Marks] c. Detect and investigate extreme temperature values that might be regarded as outliers. [5 Marks] d. Compute summary statistics for temperature, monthly variation, and anomaly values, including mean, median, standard deviation, and range. [5 Marks] e. Identify the countries included in the dataset and calculate their average temperature values. [5 Marks] f. Determine the overall trend in global temperatures over the years and visualize this trend using a suitable chart. [5 Marks] g. Identify the months with the highest and lowest temperatures for each country and find out whether there are noticeable seasonal patterns in the temperature data. [5 Marks] h. Explore the variation in temperature anomalies on a monthly basis and identify any months with consistently high or low anomalies across the years. [5 Marks] i. Choose five countries and compare the trends in their temperatures over the years, seeking any similar temperature patterns. [5 Marks] j. Explore the potential correlation between temperature and monthly variation or anomaly values. Calculate correlation coefficients and create scatterplots to investigate this relationship. [5 Marks] k. Provide an intriguing insight from the dataset by utilizing data visualization techniques such as histograms, box plots, or heatmaps to represent the data’s distribution, trends, and relationships. [5 Marks] QUESTION 2 [45 Marks] As a member of the retail analytics team, you have been contacted by the Category Manager at a retail store, who desires to gain a deeper understanding of the customers who buy chips and their purchasing habits within the region through valuable insights that will eventually be used to inform the store’s strategic plan for the chip category in the upcoming six months. You have received the following e-mail from your manager. Greetings! I am following up on our earlier conversation with a few pointers to help you succeed in this task. Here are the key areas you will be working on and what we’re looking for in each one: Firstly, examine the transaction data (“transaction_data“ file) and look for inconsistencies, missing data, outliers, correctly identify category items, and numeric data across all tables. If you notice any anomalies, please make the necessary changes in the dataset and save it for further analysis. Having clean data will make it easier for us to conduct an effective analysis. Secondly, examine the customer data (“purchase_behaviour” file) for similar issues and check for null values. Once you’re satisfied with the data, merge the transaction and customer data together for analysis, ensuring that you save your files along the way. Thirdly, conduct data analysis and identify customer segments. Define the metrics, such as total sales, drivers of sales, and the source of the highest sales. Explore the data, create charts and graphs, and note any interesting trends and insights you find. Finally, deep dive into customer segments and recommend which segments we should target. Determine if packet sizes are relative and form an overall conclusion based on your analysis. Here is the task: Your task is to provide a data-driven strategic recommendation for the upcoming category review. To achieve this, you must first analyze the current purchasing trends and behaviors to understand the customer segments and their chip purchasing behavior. To describe the customers’ purchasing behavior, you need to identify relevant metrics. The client has a specific interest in understanding the chip purchasing behavior of different customer segments. To begin the task, download the comma-separated values (CSV) data files provided to you and conduct preliminary data checks, including: • Generating and interpreting high-level data summaries. • Identifying any outliers and, if necessary, removing them (if applicable). • Verifying the data formats and correcting them, if needed (if applicable). In addition to the preliminary data checks, it is essential to extract additional features, such as pack size and brand name, from the data. Defining relevant metrics of interest is also crucial to gaining insights into the chip purchasing behavior of different customer segments. Your ultimate goal is to formulate a strategic recommendation for the Category Manager, based on your findings. Therefore, it is essential that your insights have a commercial application and can be used to inform decision-making. Lastly, a detailed report on your analysis findings, no longer than 3-4 pages, is required. The report should include any relevant visualizations you have created, as well as your recommendation to the Category Manager, to inform the store’s strategic plan for the chip category. Do not include any technical aspects of your analysis, such as coding, in the report. Note: This is an open-ended case study and can be approached in various ways, allowing for flexibility and creativity in the analysis process. Additional Pointers (column description of purchase behavior): LIFESTAGE: Customer attribute that determines if they have a family or not, and at what stage of life they are in. For instance, it considers whether their children are in preschool, primary or secondary school. PREMIUM_CUSTOMER: Customer segmentation approach that distinguishes shoppers based on the price point and product types they purchase. Its purpose is to determine whether customers are willing to pay more for brand or quality or prefer to purchase the most economical options.
Programming [75 points]: Handling wildcard queries through a binary tree based permuterm index In this assignment, you are provided with a binary tree based permuterm index. This index includes permuterm variations of all terms within the documents. Your program should handle not only a regular query (e.g., ‘hello’), but also a wildcard query (e.g., ‘hel*o’), including multiple terms (e.g., ‘hello world’ and ‘hel*o world’). By downloading hw2.zip and decompressing it; you should find three python files and five sample document files. a) cs547.py – Just like HW1, this helper class will be used to represent a student’s identifying information. Any assignments without an instantiated student object of type Student will not be graded. You do NOT need to modify this file. b) binarytree.py: An implementation of a binary tree. This class is called in hw2.py to build a binary tree index and to search terms. The existed code populates this tree with permuterm variations all linking to lists of documents containing those entries. You do NOT need to modify this file itself. c) hw2.py: This is where you will fill out two functions: wildcard_search_or(…) and wildcard_search_and(…). You should notice that five functions are already filled out: crawl_tree(…), permute(…), rotate(…), index_dir(…), tokenize(…). What you already have: crawl_tree(node, term): This function crawls a binary tree looking for permuterm matches. In the function, whether term is a wildcard query or a regular query is checked. This function can be called in itself. permute(self, term): This function generates and returns a list of permuterm variations for the given term. For example, if term is ‘hello’, the list of its permuterm variations consists of ‘hello$’, ‘ello$h’, ‘llo$he’, ‘lo$hel’, ‘o$hell’, ‘$hello’. rotate(self, term): This function rotates a wildcard term to generate a search token, if the term includes a single ‘*’. Otherwise, return the input term. For example, if term is ‘hel*o’, this function will return ‘o$hel*’. If term is ‘hello’, it will return ‘hello$’. index_dir(base_path): This function crawls through a directory of text files and generate a permuterm index of the contents using a binary tree. Tokenize() and permute() will be called. tokenize(self, text, is_search=False): This function converts a string of terms into a list of valid tokens. If is_search is False, a valid token will consist of only English alphabet characters (a-z, A-Z). Otherwise, a valid token will consist of only English alphabet characters (a-z, A-Z) and ‘*’ character. Also, this function converts all letters to lower case. What you need to fill out: wildcard_search_or(self, text), wildcard_search_and(self, text): These two functions search for the terms in “text” in our index. Text can be a single term or multiple terms with/without a single ‘*’. Return a list of filenames with appropriate results. tokenize(), crawl_tree() and _rotate() will be called. We have provided a sample set of five documents. Feel free to test your code over these five documents. You should be able to manually check whether your code is doing what you think it should. Once you submit your final code, we will evaluate it by constructing an index over our own set of 1,000 documents in a single folder and issuing several queries (single term and multiple terms with/without a single ‘*’ symbol). You do NOT need to consider a wildcard query including multiple ‘*’ in this assignment (e.g. he*l*). We have provided a sample set of five documents. Feel free to test your code over these five documents. You should be able to manually check whether your code is doing what you think it should. Once you submit your final code, we will evaluate it by constructing an index over our own set of 1,000 documents and issuing several queries (single term and multiple terms). Short-Answer Questions [25 points]: 1. Investigate two search engines (e.g., Google and Bing), and figure out what tricks they are using by explaining your testing method. (15%) (a) Do both search engines use stemming? (b) Do both search engines filter stop words? 2. For a particular search query, your IR system returns 14 relevant documents and 16 irrelevant documents. There are a total of 80 relevant documents in the collection. (10%) (a) What is the precision of the system on this search? (b) What is the recall of the system on this search? What to turn in: Rename hw2.py to hw2_firstname_lastname.py (e.g., hw2_steve_jobs.py) and submit to Canvas your hw2.py file and a document file consisting of answers of the question. This is an individual assignment, but you may discuss general strategies and approaches with other members of the class (refer to the syllabus for details of the homework collaboration policy). At the top of hw1.py you will see a list of COLLABORATORS. Please fill this out with the names of classmates you consulted and the nature of your discussion.
For this assignment, you will: (0 pts) Get started with Python (100 pts) Implement boolean retrieval with Python Part 0: Getting Started with Python As in all the homeworks this semester, you will be using Python. So let’s get started first with our Python installation. Python Installation and Basic Configuration First off, you will need to install a recent version of python 3 in here. There are lots of online resources for help in installing Python. Alternatively, there is a nice collection called Anaconda, that comes with Python plus tons of helpful packages that we may use down the line in this course: Anaconda: which has install instructions for Windows, Linux, and Mac OS. Part 1: Boolean retrieval with Python For this assignment, we will build a basic boolean search engine, including the tokenization, stemming, indexing, and basic boolean query components. Begin by downloading hw1.zip and decompressing it; you should find three python files and five sample document files. a) cs547.py – This helper class will be used to represent a student’s identifying information in all programming assignments. In this assignment, you don’t need to modify it. But, you must instantiate a student object in hw1.py. Any assignments without an instantiated student object of type Student will not be graded. b) PorterStemmer.py – Implemenation of Porter Stemmer. Just use it for stemming. Do NOT modify it. c) hw1.py – This is where you will fill out four functions: index_dir(…), tokenize(…), stemming(…) and boolean_search(…). Your overall goal is to build a basic index that can support simple boolean queries (AND, OR queries only). index_dir(self, base_path): This function should crawl through a nested directory of text files and generate an inverted index of the contents of these files. We give a simple example in the comments of hw1.py to illustrate what the resulting index should look like. This function should call tokenize and stemming as helper functions. tokenize(self, text): This helper function should take a raw string (either extracted from each file or input by the user as a query) and convert it to a list of valid tokens. For the purposes of this assignment, a valid token consists of only English alphabet characters (a-z, A-Z) or numbers (0-9). All other characters are considered as token delimiters. For example, the text ‘http://www.cnn33.com’ should result in the tokens: ‘http’, ‘www’, ‘cnn33’ and ‘com’. The text: ‘This 3rd class is cool, right?’ should result in the tokens: ‘this’, ‘3rd’, ‘class’, ‘is’, ‘cool’ and ‘right’. Note that you should convert all letters to lower case. stemming(self, tokens): This helper function should take a list of tokens returned from the tokenize function. Convert the list of tokens to a list of stemmed tokens. For the stemming task, you should use Potter Stemmer, importing PorterStemmer.py. boolean_search(self, text): This function search for the terms in the string text in the inverted index. This function should call tokenize and stemming as helper functions to tokenize the string text and do stemming, respectively. In order to make search function simple, we assume “text” contain either single term or three terms. If “text” contains only single term, search the term from the inverted index and return document names containing the term. If “text” contains three terms including “or” or “and”, do OR or AND search depending on the second term (“or” or “and”) in the “text”. For example, if the boolean_search is called with ‘mike or sherman’, then the return value should be a list of documents that include ‘mike’ OR ‘sherman’. If the boolean_search is called with ‘mike and sherman’, then the return value should be a list of documents that include ‘mike’ AND ‘sherman’. We have provided a sample set of five documents. Feel free to test your code over these five documents. You should be able to manually check whether your code is doing what you think it should. Once you submit your final code, we will evaluate it by constructing an index over our own set of 1,000 documents and issuing several queries (single term and three terms containing either “or” or “and”). What to turn in: Submit to Canvas your hw1.py file. This is an individual assignment, but you may discuss general strategies and approaches with other members of the class (refer to the syllabus for details of the homework collaboration policy). At the top of hw1.py you will see a list of COLLABORATORS. Please fill this out with the names of classmates you consulted and the nature of your discussion.
1. Visualizing SGD Trajectories of Fully-Connected Neural Networks (FCNNs) [25 pts]: First, read all the “Introduction to PyTorch” tutorials (see https://pytorch.org/tutorials/beginner/ basics/intro.html), including “Training a classifier”, “Learn the Basics”, “Quickstart”, “Tensors”, “Datasets & Dataloaders”, “Transforms”, “Build Model”, “Autograd”, “Optimization”, and “Save & Load Model”. Then complete the following tasks: (a) [10 pts]: Use PyTorch to build and train a simple FCNN with at least 2 hidden layers to classify the Fashion MNIST dataset (similar to Homework 3). You can use label-preserving transformations such as rotation (make sure not to rotate the images by more than, say, ±10◦ ) to improve generalization. Your network must be fully-connected – it cannot use convolution. Report the test accuracy you get in the PDF. (b) [10 pts]: For any fixed FCNN architecture with at least 2 hidden layers, visualize the gradient descent trajectory in 3-D from two different random parameter initializations. In your plot, you should use the first two axes to represent different values for the NN’s parameters (which we will denote here collectively simply as p) and the third (vertical) axis to represent the crossentropy fCE(p). Of course, there are far (!) more than just 2 parameters in the NN, and thus it will be necessary to perform dimensionality reduction. You should use principal component analysis (PCA) to reduce the parameter space down to just 2 dimensions. The two PCs will represent different “directions” along which the NN parameters can vary, where these directions are chosen to minimize the reconstruction error of the data. In general, it will not be the case that a PC corresponds to just a single weight/bias; rather, moving along each PC axis will correspond to changing all of the NN’s parameters at once. Concrete steps: i. Run SGD at least two times to collect multiple trajectories of p. (Ask ChatGPT for help on how to extract all the parameters from the entire NN as a single vector.) For each value, save the corresponding training cost fCE(p) – you will need these later. To keep things tractable, train the network on just 1000 examples of Fashion MNIST. ii. Use the collected p vectors to estimate the first 2 principal components that map from the full parameter space down to just 2 dimensions (see sklearn.decomposition.PCA). iii. For each p that was encountered during training, project it into the 2-d space, and then plot it as part of a 3-d scatter plot with its associated fCE(p) value that was computed during SGD. iv. To give a sense of the “landscape” through which SGD is “hiking”, compute a dense grid of at least 25×25 points in the 2-d PC space. For each point p˜ in this 2-d space, project it back into the full parameter space (see PCA.inverse transform) to obtain a value for p. Load these parameters p into the NN (ask ChatGPT for help on how to do this). Finally, make a surface plot (use the plot surface function in matplotlib) showing the corresponding cost values over all points in this grid. Render this surface plot in the same figure as the 3-d scatter plot. Make sure to compute the fCE values (for both the 3-d scatter plot and the surface plot) on the set of 1000 training data, since that is what SGD directly optimizes. Include your figure in the PDF you submit. An example figure is shown below: 1 As you can see, the two SGD trajectories started on different sides of a ridge and ended up descending into different valleys. (c) [3 pts]: In the figure you created in part (b), the fCE values of the points in the 3-d scatter plot (computed during SGD) do not always exactly equal the corresponding values from the surface plot generated on the dense grid of points. Why is that, and why would it be impractical to create a surface plot over the grid that exactly matches the “real” fCE values obtained using SGD? (Think about how PCA works and how it is used here.) Answer in a few sentences in the PDF. (d) [2 pts]: Assume that the set of all the p vectors you collected has zero mean (i.e., the sum of all the p vectors equals the zero vector). Let p, p ′ represent two different configurations of the NNs parameters, and let p˜, p˜ ′ ∈ R 2 represent their respective projections in the 2-d PC space. Furthermore, let pˆ, pˆ ′ represent the reconstructions (using PCA.inverse transform) of the NN’s parameters from p˜, p˜ ′ . Which of the following statements are always true? Report the correct statements in your PDF. i. If p = 2p ′ (i.e., the NN parameters in the first configuration are all twice the magnitude of the corresponding parameters in the second configuration), then p˜ = 2p˜ ′ . ii. If p˜ = 2p˜ ′ , then p = 2p ′ . iii. If p˜ = 2p˜ ′ , then pˆ = 2pˆ ′ . iv. fCE(pˆ) ≤ fCE(p). 2. Simple CNN for Fashion MNIST [15 pts]: Read the following PyTorch tutorial: https:// pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. Then, apply the same methodology to the Fashion MNIST dataset (see the torchvision.datasets.FashionMNIST class). Note that, since Fashion MNIST images are grayscale, they have just a single color channel rather than 3. Hence, you will need to adapt the CNN slightly so that the number of input channels in the first convolutional layer is 1 instead of 3. Also, the normalization step transforms.Normalize will use just 1 channel instead of 3. Finally, as the image size is different compared to CIFAR10, the size of the feature maps will also be different. You will thus need to update the number of incoming neurons to the first fully-connected (nn.Linear) layer accordingly. After making these modifications, train and evaluate a classifier on this dataset. Report the test accuracy in the PDF you submit. 3. Supervised Pre-Training and Fine-Tuning of CNNs [15 pts]: Read the following PyTorch tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html. 2 Then, apply the same methodology to the Fashion MNIST dataset. Report the test accuracies in the PDF you submit, using either (a) fine-tuning of the whole model or (b) training just the final (classification) layer. 4. CNNs for Behavioral Cloning in Pong [20 pts]: Train an AI agent to play Pong (see https: //www.gymlibrary.dev/environments/atari/pong/). In this game, each player can execute one of 6 possible actions at each timestep (NOOP, FIRE, RIGHT, LEFT, RIGHTFIRE, and LEFTFIRE). The goal is to execute the best action at each timestep based on the current state of the game. To get started, first download the following files: • https://s3.amazonaws.com/jrwprojects/pong_actions.pt • https://s3.amazonaws.com/jrwprojects/pong_observations.pt Together, these files define (image, action) pairs generated by an expert player from the Atari Pong video game. Using PyTorch, implement and train any NN architecture you choose (I recommend a simple CNN) to map the images to their corresponding expert actions. Your NN will implement the control policy that dictates how the agent behaves in differnet situations, and the approach of training this NN in a supervised manner from expert trajectories is called behavioral cloning. After training your NN, save it to a file (use torch.save). Then, load the model in play atari.py and see how well your AI “player” does against the computer. To receive full credit, your trained agent should be able to beat the computer (i.e., reach 21 points first). (Note: for this part of the homework, you need to use a standard Python interpreter – Google Colab cannot render the frames of the video game.) In addition to your Python code (homework4 WPIUSERNAME1.py or homework4 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework4 WPIUSERNAME1.pdf or homework4 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above.
1. Python and Numpy Warm-up Exercises [20 pts]: This part of the homework is intended to help you practice your linear algebra and how to implement linear algebraic and statistical operations in Python using numpy (to which we refer in the code below as np). For each of the problems below, write a method (e.g., problem 1a) that returns the answer for the corresponding problem. In all problems, you may assume that the dimensions of the matrices and/or vectors that are given as input are compatible for the requested mathematical operations. You do not need to perform error-checking. Note 1: In mathematical notation we usually start indices with j = 1. However, in numpy (and many other programming settings), it is more natural to use 0-based array indexing. When answering the questions below, do not worry about “translating” from 1-based to 0-based indexes. For example, if the (i, j)th element of some matrix is requested, you can simply write A[i,j]. Note 2: To represent and manipulate vectors and matrices, please use numpy’s array class (not the matrix class). Note 3: While the difference between a row vector and a column vector is important when doing math, numpy does not care about this difference as long as the array is 1-D. This means, for example, that if you want to compute the inner product between two vectors x and y, you can just write x.dot(y) without needing to transpose the x. If x and y are 2-D arrays, however, then it does matter whether they are row-vectors or column-vectors, and hence you might need to transpose accordingly. (a) Given two matrices A and B, compute and return an expression for A + B. [ 0 pts ] Answer : While it is completely valid to use np.add(A, B), this is unnecessarily verbose; you really should make use of the “syntactic sugar” provided by Python’s/numpy’s operator overloading and just write: A + B. Similarly, you should use the more compact (and arguably more elegant) notation for the rest of the questions as well. (b) Given matrices A, B, and C, compute and return AB − C (i.e., right-multiply matrix A by matrix B, and then subtract C). Use dot or np.dot. [ 1 pts ] (c) Given matrices A, B, and C, return A⊙B+C⊤, where ⊙ represents the element-wise (Hadamard) product and ⊤ represents matrix transpose. In numpy, the element-wise product is obtained simply with *. [ 1 pts ] (d) Given column vectors x and y, compute the inner product of x and y (i.e., x ⊤y). [ 1 pts ] (e) Given matrix A and integer i, return the sum of all the entries in the ith row whose column index is even, i.e., P j:j is even Aij . Do not use a loop, which in Python can be very slow. Instead use the np.sum function. [ 2 pts ] (f) Given matrix A and scalars c, d, compute the arithmetic mean over all entries of A that are between c and d (inclusive). In other words, if S = {(i, j) : c ≤ Aij ≤ d}, then compute 1 |S| P (i,j)∈S Aij . Use np.nonzero along with np.mean. [ 2 pts ] (g) Given an (n×n) matrix A and integer k, return an (n×k) matrix containing the right-eigenvectors of A corresponding to the k eigenvalues of A with the largest absolute value. Use np.linalg.eig. [ 2 pts ] 1 (h) Given a column vector (with n components) x, an integer k, and positive scalars m, s, return an (n × k) matrix, each of whose columns is a sample from multidimensional Gaussian distribution N (x + mz, sI), where z is column vector (with n components) containing all ones and I is the identity matrix. Use either np.random.multivariate normal or np.random.randn. [ 2 pts ] (i) Given a matrix A with n rows, return a matrix that results from randomly permuting the columns (but not the rows) in A. [ 2 pts] (j) Z-scoring: Given a vector x, return a vector y such that each yi = (xi − x)/σ, where x is the mean (use np.mean) of the elements of x and σ is the standard deviation (use np.std). [ 2 pts ] (k) Given an n-vector x and a non-negative integer k, return a n × k matrix consisting of k copies of x. You can use numpy methods such as np.newaxis, np.atleast 2d, and/or np.repeat. [ 2 pts ] (l) Given a k × n matrix X =x (1) . . . x (n)and a k × m matrix Y =y (1) . . . y (m), compute an n × m matrix D = d11 . . . d1m . . . dn1 . . . dnm consisting of all pairwise L2 distances dij = ∥x (i) − y (j)∥2. In this problem you may not use loops. Instead, you can avail yourself of numpy objects & methods such as np.newaxis, np.atleast 3d, np.repeat, np.swapaxes, etc. (There are various ways of solving it.) Hint: from X (resp. Y), construct a 3-d matrix that contains multiple copies of each of the vectors in X (resp. Y); then subtract these 3-d matrices. [ 3 pts ] 2. Training 2-Layer Linear Neural Networks with Stochastic Gradient Descent [25 pts]: (a) Train an age regressor that analyzes a (48 × 48 = 2304)-pixel grayscale face image and outputs a real number ˆy that estimates how old the person is (in years). The training and testing data are available here: • https://s3.amazonaws.com/jrwprojects/age_regression_Xtr.npy • https://s3.amazonaws.com/jrwprojects/age_regression_ytr.npy • https://s3.amazonaws.com/jrwprojects/age_regression_Xte.npy • https://s3.amazonaws.com/jrwprojects/age_regression_yte.npy Your prediction model g should be a 2-layer linear neural network that computes ˆy = g(x; w) = x ⊤w + b, where w is the vector of weights and b is the bias term. The cost function you should optimize is fMSE(w, b) = 1 2n Xn i=1 (ˆy (i) − y (i) ) 2 where n is the number of examples in the training set Dtr = {(x (1), y(1)), . . . ,(x (n) , y(n) )}, each x (i) ∈ R 2304 and each y (i) ∈ R. To optimize the weights, you should implement stochastic gradient descent (SGD). Note: you must complete this problem using only linear algebraic operations in numpy – you may not use any off-the-shelf linear regression or neural network software, as that would defeat the purpose. There are several different hyperparameters that you will need to optimize: • Mini-batch size ˜n. • Learning rate ϵ. • Number of epochs. In order not to cheat (in the machine learning sense) – and thus overestimate the performance of the network – it is crucial to optimize the hyperparameters only on a validation set. (The training set would also be acceptable but typically leads to worse performance.) To create 2 a validation set, simply set aside a fraction (e.g., 20%) of the age regression Xtr.npy and age regression ytr.npy to be the validation set; the remainder (80%) of these data files will constitute the “actual” training data. While there are fancier strategies (e.g., Bayesian optimization) that can be used for hyperparameter optimization, it’s often effective to just use a grid search over a few values for each hyperparameter. In this problem, you are required to explore systematically (e.g., using nested for loops) at least 2 different values for each hyperparameter. Performance evaluation: Once you have tuned the hyperparameters and optimized the weights and bias term so as to minimize the cost on the validation set, then: (1) stop training the network and (2) evaluate the network on the test set. Report both the training and the test fMSE in the PDF document you submit, as well as the training cost values for the last 10 iterations of gradient descent (just to show that you actually executed your code). 3. Gradient descent: what can go wrong? [30 pts] Please enter your code in a file named homework1 problem3.py, with one Python function (e.g., problem3a) for each subproblem. (a) [10 pts]: The graph below plots the function f(x) that is defined piece-wise as: f(x) = −x 3 : x < −0.1 −3x/100 − 1/500 : −0.1 ≤ x < 3 −(x − 31/10)3 − 23/250 : 3 ≤ x < 5 1083 200 (x − 6)2 − 6183/500 : x >= 5 4 2 0 2 4 6 8 10 0 20 40 60 As you can see, the function has a long nearly flat section (sometimes known as a plateau) just before the minimum.1 Plateaux can cause big problems during optimization. To show this: i. Derive by hand the (piece-wise) function ∇f and implement it in Python/numpy. ii. Use your implementation of ∇f to conduct gradient descent for T = 100 iterations. Always start from an initial x = −3. Try using various learning rates: 1e-3, 1e-2, 1e-1, 1e0, 1e1. Plot f, ∇f, as well as superimposed dots that show the sequence ((x (1), y(1)), . . . ,(x (T) , y(T) )) of gradient descent. Use plt.legend to indicate which scatter plot corresponds to which learning rate. iii. Describe in 1-2 sentences what you observe during gradient descent for the set of learning rates listed above. iv. Find a learning rate ϵ for which gradient descent successfully converges to min f(x), and report ϵ in the PDF file. (b) [8 pts]: Even a convex paraboloid – i.e., a parabola in multiple dimensions that has only one local minimum and no plateaux – can cause problems for “vanilla” SGD (i.e., the kind we’ve learned in 1The ugly constants in f were chosen to give rise to these characteristics while ensuring that it remains differentiable. 3 class so far). Examine the scatter-plot below which shows the sequence ((x (1), y(1)), . . . ,(x (T) , y(T) )) of gradient descent on a convex paraboloid f, starting at x (1) = [1, −3]⊤, where each x ∈ R 2 . The descent produces a zig-zag pattern that takes a long time to converge to the local minimum. 4 2 0 2 4 x1 3 2 1 0 1 2 3 4 x2 i. Speculate how the SGD trajectory would look if the learning rate were made to be very small (e.g., 100x smaller than in the figure above). ii. Let f(x1, x2) = a1(x1 − c1) 2 + a2(x2 − c2) 2 . Pick values for a1, a2, c1, c2 so that – when the scatter-plot shown above is superimposed onto it – the gradient descent is realistic for f. Rather than just guess randomly, consider: why would the zig-zag be stronger in one dimension than another, and how would this be reflected in the function’s constants? Plot a contour graph using plt.contour and superimpose the scatter-plot using plt.scatter. You can find the gradient descent sequence in gradient descent sequence.txt. Note that you are not required to find the exactly correct constants or even to estimate them algorithmically. Rather, you can combine mathematical intuition with some trial-and-error. Your solution should just look visually “pretty close” to get full credit. Note: to ensure proper rendering, use plt.axis(’equal’) right before calling plt.show(). (c) [6 pts]: This problem is inspired by this paper. Consider the function f(x) = 2 3 |x| 3/2 . Derive ∇f, implement gradient descent and plot the descent trajectories ((x (1), y(1)), . . . ,(x (T) , y(T) )) for a variety of learning rates 1e-3, 1e-2, 1e-1 and a variety of starting points. See what trend emerges, and report it in the PDF. (d) [6 pts]: While very (!) unlikely, it is theoretically possible for gradient descent to converge to a local maximum. Give the formula of a function (in my own solution, I used a degree-4 polynomial), a starting point, and a learning rate such that gradient descent converges to a local maximum after 1 descent iteration (i.e., after 1 iteration, it reaches the local maximum, and the gradient is exactly 0). Prove (by deriving the exact values for the descent trajectory) that this is true. You do not need to implement this in code (and, in fact, due to finite-precision floating-point arithmetic, it might not actually converge as intended). Submission: Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If you are working as part of a group, then only one member of your group should submit (but make sure you have already signed up in a pre-allocated team for the homework on Canvas).1. A linear NN will never solve the XOR problem [10 points, on paper]: Read the description of the XOR problem from Section 6.1 of the Deep Learning textbook: https://www.deeplearningbook. org/contents/mlp.html. Assume that you wish to classify the 4 points in this dataset using a 2-layer linear NN (i.e., the same model as in Homework 1, Problem 2). Then show (by deriving the gradient, setting to 0, and solving mathematically, not in Python) that the values for w = [w1, w2] ⊤ and b that minimize the function fMSE(w, b) in Equation 6.1 are: w1 = 0, w2 = 0, and b = 0.5 – in other words, the best prediction line is simply flat and always guesses ˆy = 0.5. 2. Derivation of softmax regression gradient updates [20 points, on paper]: As explained in class, let W =w(1) . . . w(c)be an m×c matrix containing the weight vectors from the c different classes. The output of the softmax regression neural network is a vector with c dimensions such that: yˆk = P exp zk c k′=1 exp zk′ (1) zk = x ⊤w(k) + bk for each k = 1, . . . , c. Correspondingly, our cost function will sum over all c classes: fCE(W, b) = − 1 n Xn i=1 Xc k=1 y (i) k log ˆy (i) k Important note: When deriving the gradient expression for each weight vector w(l) , it is crucial to keep in mind that the weight vector for each class l ∈ {1, . . . , c} affects the outputs of the network for every class, not just for class l. This is due to the normalization in Equation 1 – if changing the weight vector increases the value of ˆyl , then it necessarily must decrease the values of the other ˆyl ′̸=l . In this homework problem, please complete the following derivation that is outlined below: Derivation: For each weight vector w(l) , we can derive the gradient expression as: ∇w(l) fCE(W, b) = − 1 n Xn i=1 Xc k=1 y (i) k ∇w(l) log ˆy (i) k = − 1 n Xn i=1 Xc k=1 y (i) k∇w(l) yˆ (i) k yˆ (i) k ! We handle the two cases l = k and l ̸= k separately. For l = k: ∇w(l) yˆ (i) k = complete me… = x (i) yˆ (i) l (1 − yˆ (i) l ) 1 For l ̸= k: ∇w(l) yˆ (i) k = complete me… = −x (i) yˆ (i) k yˆ (i) l To compute the total gradient of fCE w.r.t. each w(k) , we have to sum over all examples and over l = 1, . . . , c. (Hint: P k ak = al + P k̸=l ak. Also, P k yk = 1.) ∇w(l) fCE(W, b) = − 1 n Xn i=1 Xc k=1 y (i) k ∇w(l) log ˆy (i) k = complete me… = − 1 n Xn i=1 x (i) y (i) l − yˆ (i) l Finally, show that ∇bfCE(W, b) = − 1 n Xn i=1 y (i) − yˆ (i) 3. Implementation of softmax regression [25 points, in Python code]: Train a 2-layer softmax neural network to classify images of fashion items (10 different classes, such as shoes, t-shirts, dresses, etc.) from the Fashion MNIST dataset. The input to the network will be a 28 × 28-pixel image (converted into a 784-dimensional vector); the output will be a vector of 10 probabilities (one for each class). The cross-entropy loss function1 that you minimize should be fCE(w(1) , . . . , w(10), b(1), . . . , b(10)) = − 1 n Xn i=1 X 10 k=1 y (i) k log ˆy (i) k + α 2 Xc k=1 w(k)⊤ w(k) where n is the number of examples and α is a regularization constant.. Note that each ˆyk implicitly depends on all the weights W =w(1) , . . . , w(10) and biases b =b (1), . . . , b(10) . To get started, first download the Fashion MNIST dataset from the following web links: • https://s3.amazonaws.com/jrwprojects/fashion_mnist_train_images.npy • https://s3.amazonaws.com/jrwprojects/fashion_mnist_train_labels.npy • https://s3.amazonaws.com/jrwprojects/fashion_mnist_test_images.npy • https://s3.amazonaws.com/jrwprojects/fashion_mnist_test_labels.npy These files can be loaded into numpy using np.load. Each “labels” file consists of a 1-d array containing n labels (valued 0-9), and each “images” file contains a 2-d array of size n×784, where n is the number of images. Next, implement stochastic gradient descent (SGD) to minimize the cross-entropy loss function on this dataset. Regularize the weights but not the biases. Optimize the same hyperparameters as in 1 In this equation, the regularization term is not divided by n like in the lecture notes. Either equation is valid since the 1/n can be subsumed into α. Here, for simplicity, the 1/n is omitted. 2 homework 1 problem 2 (age regression). You should also use the same methodology as for the previous homework, including the splitting of the training files into validation and training portions. Performance evaluation: Once you have tuned the hyperparameters and optimized the weights so as to maximize performance on the validation set, then: (1) stop training the network and (2) evaluate the network on the test set. Record the performance both in terms of (unregularized) cross-entropy loss (smaller is better) and percent correctly classified examples (larger is better); put this information into the PDF you submit. Hint 1: it accelerates training if you first normalize all the pixel values of both the training and testing data by dividing each pixel by 255. Hint 2: when using functions like np.sum and np.mean, make sure you know what the axis and keepdims parameters mean and that you use them in a way that is consistent with the math! 4. Logistic Regression [15 points, on paper]: Consider a 2-layer neural network that computes the function yˆ = σ(x ⊤w + b) where x is an example, w is a vector of weights, b is a bias term, and σ is the logistic sigmoid function. Assume we train this network using the log loss, as described in class. Moreover, suppose all the training examples are positive. Answer the following questions about convergence. (Informally, a sequence of numbers converges if it gets closer and closer to a specific number as the sequence progresses. A sequence that does not converge can do different things, e.g., change erratically, or grow towards +/−∞.) While you are not required to give formal proofs, you should explain your reasoning, which could either be a mathematical argument or a simulation result. Put your answers into your PDF file. (a) Given a well-chosen learning rate: what value will the training loss converge to during gradient descent? (b) Given a well-chosen learning rate: will b always converge; does convergence depend on the exact training examples; or does it never converge? (c) Suppose the training set contains exactly 2 examples, x (1) , x (2) ∈ R 2 . Give specific non-zero values for these training data such that: i. w will converge during gradient descent (given a well-chosen learning rate). ii. w will not converge during gradient descent (no matter what the learning rate). Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If you are working as part of a group, then only one member of your group should submit (but make sure you have already signed up in a pre-allocated team for the homework on Canvas). Please name your submission files using your names in the following format: ( ). For example: jacob whitehill yao su.zip1. Feed-forward neural network [60 points]: In this problem you will train a multi-layer neural network to classify images of fashion items (10 different classes) from the Fashion MNIST dataset. Similarly to Homework 3, the input to the network will be a 28 × 28-pixel image; the output will be a real number. Specifically, the network you create should implement a function f : R 784 → R 10, where: z (1) = W(1)x + b (1) h (1) = relu(z (1)) z (2) = W(2)h (1) + b (2) . . . . . . z (l) = W(l)h (l−1) + b (l) ˆy = softmax(z (l) ) The network specified above is shown in the figure below: … … … x z (1) z (2) … h (1) … h (2) W(1) W(2) b (2) … ^ z (l) … … y b (l) b (1) W(l) … As usual, the (unregularized) cross-entropy cost function should be fCE(W(1) , b (1) , . . . ,W(l) , b (l) ) = − 1 n Xn i=1 X 10 k=1 y (i) k log ˆy (i) k where n is the number of examples. Hyperparameter tuning: In this problem, there are several different hyperparameters and architectural design decisions that will impact the network’s performance: • Number of hidden layers (suggestions: {3, 4, 5}) • Number of units in each hidden layer (suggestions: {30, 40, 50}) • Learning rate (suggestions: {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}) • Minibatch size (suggestions: 16, 32, 64, 128, 256) • Number of epochs • L2 Regularization strength applied to the weight matrices (but not bias terms) • Frequency & rate of learning rate decay • Variance & type of random noise added to training examples for data augmentation • . . . 1 These can all have a big impact on the test accuracy. In contrast to previous assignments, there is no specific requirement for how to optimize them. However, in practice it will be necessary to do so in order to get good results. Numerical gradient check: To make sure that you are implementing your gradient expressions correctly, you should use the check grad (and possibly its sister function, approx fprime). These methods take a function f (i.e., a Python method that computes a function; in practice, this will be the regularized cross-entropy function you code) as well as a set of points on which to compute f’s derivative (some particular values for the weights and biases). The approx fprime will return the numerical estimate of the gradient of f, evaluated at the points you provided. The check grad also takes another parameter, ∇f, which is what you claim is the Python function that returns the gradient of the function f you passed in. check grad computes the discrepancy (averaged over a set of points that you specified) between the numerical and analytical derivatives. Both of these methods require that all the parameters of the function (in practice: the weights and biases of your neural network) are “packed” into a single vector (even though the parameters actually constitute both matrices and vectors). For this reason, the starter code we provide includes a method called unpack that take a vector of numbers and extracts W(1) , W(2), . . . , as well as b (1) , b (2), . . . . Note that the training data and training labels are not parameters of the function f whose gradient you are computing/estimating, even though they are obviously needed by the cross-entropy function to do its job. For this reason, we “wrap” the call to fCE with a Python lambda expression in the starter code. Your tasks: (a) Implement stochastic gradient descent (SGD; see Section 5.9 and Algorithm 6.4 in the Deep Learning textbook, https://www.deeplearningbook.org/) for the multi-layer neural network shown above. Important: your backprop algorithm must work for any number of hidden layers. (b) Verify that your gradient function is correct using the check grad method. In particular, include in your PDF the real-valued output of the call to check grad for a neural network with 3 hidden layers, each with 64 neurons (the discrepancy between the numerical and analytical derivative for this case should be less than 1e-4). (c) Include a screenshot in your submitted PDF file showing multiple iterations of SGD (just to show that you actually ran your code successfully). For each iteration, report both the test accuracy and test unregularized cross-entropy. For full credit, the accuracy (percentage correctly classified test images) should be at least 88%. (d) Visualize the first layer of weights W(1) that are learned after training your best neural network. In particular, reshape each row of the weights matrix into a 28 × 28 matrix, and then create a “grid” of such images. Include this figure in your PDF. Here is an example (for W(0) ∈ R 64×784). Recommended strategy: First, (1) implement the forward- and back-propagation phases so that you can perfectly (as verified with check grad) compute the gradient for just a single training example. You will likely (unless you do everything perfectly from the get-go) need to “break” the 2 gradient vector, as returned by your gradCE and the approx fprime methods, into its individual components (for the individual weight matrices and bias terms) so that you can compare each of them one-by-one and see where any problems lie. Next, (2) you can implement minibatches of size ˜n > 1 by simply iterating over the minibatch in a for-loop. Finally – and only after you have correctly implemented step (2) – replace the for-loop (which is relatively slow) with matrix operations that compute the same result. In addition to your Python code (homework3 WPIUSERNAME1.py or homework3 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework3 WPIUSERNAME1.pdf or homework3 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above.1. Visualizing SGD Trajectories of Fully-Connected Neural Networks (FCNNs) [25 pts]: First, read all the “Introduction to PyTorch” tutorials (see https://pytorch.org/tutorials/beginner/ basics/intro.html), including “Training a classifier”, “Learn the Basics”, “Quickstart”, “Tensors”, “Datasets & Dataloaders”, “Transforms”, “Build Model”, “Autograd”, “Optimization”, and “Save & Load Model”. Then complete the following tasks: (a) [10 pts]: Use PyTorch to build and train a simple FCNN with at least 2 hidden layers to classify the Fashion MNIST dataset (similar to Homework 3). You can use label-preserving transformations such as rotation (make sure not to rotate the images by more than, say, ±10◦ ) to improve generalization. Your network must be fully-connected – it cannot use convolution. Report the test accuracy you get in the PDF. (b) [10 pts]: For any fixed FCNN architecture with at least 2 hidden layers, visualize the gradient descent trajectory in 3-D from two different random parameter initializations. In your plot, you should use the first two axes to represent different values for the NN’s parameters (which we will denote here collectively simply as p) and the third (vertical) axis to represent the crossentropy fCE(p). Of course, there are far (!) more than just 2 parameters in the NN, and thus it will be necessary to perform dimensionality reduction. You should use principal component analysis (PCA) to reduce the parameter space down to just 2 dimensions. The two PCs will represent different “directions” along which the NN parameters can vary, where these directions are chosen to minimize the reconstruction error of the data. In general, it will not be the case that a PC corresponds to just a single weight/bias; rather, moving along each PC axis will correspond to changing all of the NN’s parameters at once. Concrete steps: i. Run SGD at least two times to collect multiple trajectories of p. (Ask ChatGPT for help on how to extract all the parameters from the entire NN as a single vector.) For each value, save the corresponding training cost fCE(p) – you will need these later. To keep things tractable, train the network on just 1000 examples of Fashion MNIST. ii. Use the collected p vectors to estimate the first 2 principal components that map from the full parameter space down to just 2 dimensions (see sklearn.decomposition.PCA). iii. For each p that was encountered during training, project it into the 2-d space, and then plot it as part of a 3-d scatter plot with its associated fCE(p) value that was computed during SGD. iv. To give a sense of the “landscape” through which SGD is “hiking”, compute a dense grid of at least 25×25 points in the 2-d PC space. For each point p˜ in this 2-d space, project it back into the full parameter space (see PCA.inverse transform) to obtain a value for p. Load these parameters p into the NN (ask ChatGPT for help on how to do this). Finally, make a surface plot (use the plot surface function in matplotlib) showing the corresponding cost values over all points in this grid. Render this surface plot in the same figure as the 3-d scatter plot. Make sure to compute the fCE values (for both the 3-d scatter plot and the surface plot) on the set of 1000 training data, since that is what SGD directly optimizes. Include your figure in the PDF you submit. An example figure is shown below: 1 As you can see, the two SGD trajectories started on different sides of a ridge and ended up descending into different valleys. (c) [3 pts]: In the figure you created in part (b), the fCE values of the points in the 3-d scatter plot (computed during SGD) do not always exactly equal the corresponding values from the surface plot generated on the dense grid of points. Why is that, and why would it be impractical to create a surface plot over the grid that exactly matches the “real” fCE values obtained using SGD? (Think about how PCA works and how it is used here.) Answer in a few sentences in the PDF. (d) [2 pts]: Assume that the set of all the p vectors you collected has zero mean (i.e., the sum of all the p vectors equals the zero vector). Let p, p ′ represent two different configurations of the NNs parameters, and let p˜, p˜ ′ ∈ R 2 represent their respective projections in the 2-d PC space. Furthermore, let pˆ, pˆ ′ represent the reconstructions (using PCA.inverse transform) of the NN’s parameters from p˜, p˜ ′ . Which of the following statements are always true? Report the correct statements in your PDF. i. If p = 2p ′ (i.e., the NN parameters in the first configuration are all twice the magnitude of the corresponding parameters in the second configuration), then p˜ = 2p˜ ′ . ii. If p˜ = 2p˜ ′ , then p = 2p ′ . iii. If p˜ = 2p˜ ′ , then pˆ = 2pˆ ′ . iv. fCE(pˆ) ≤ fCE(p). 2. Simple CNN for Fashion MNIST [15 pts]: Read the following PyTorch tutorial: https:// pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. Then, apply the same methodology to the Fashion MNIST dataset (see the torchvision.datasets.FashionMNIST class). Note that, since Fashion MNIST images are grayscale, they have just a single color channel rather than 3. Hence, you will need to adapt the CNN slightly so that the number of input channels in the first convolutional layer is 1 instead of 3. Also, the normalization step transforms.Normalize will use just 1 channel instead of 3. Finally, as the image size is different compared to CIFAR10, the size of the feature maps will also be different. You will thus need to update the number of incoming neurons to the first fully-connected (nn.Linear) layer accordingly. After making these modifications, train and evaluate a classifier on this dataset. Report the test accuracy in the PDF you submit. 3. Supervised Pre-Training and Fine-Tuning of CNNs [15 pts]: Read the following PyTorch tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html. 2 Then, apply the same methodology to the Fashion MNIST dataset. Report the test accuracies in the PDF you submit, using either (a) fine-tuning of the whole model or (b) training just the final (classification) layer. 4. CNNs for Behavioral Cloning in Pong [20 pts]: Train an AI agent to play Pong (see https: //www.gymlibrary.dev/environments/atari/pong/). In this game, each player can execute one of 6 possible actions at each timestep (NOOP, FIRE, RIGHT, LEFT, RIGHTFIRE, and LEFTFIRE). The goal is to execute the best action at each timestep based on the current state of the game. To get started, first download the following files: • https://s3.amazonaws.com/jrwprojects/pong_actions.pt • https://s3.amazonaws.com/jrwprojects/pong_observations.pt Together, these files define (image, action) pairs generated by an expert player from the Atari Pong video game. Using PyTorch, implement and train any NN architecture you choose (I recommend a simple CNN) to map the images to their corresponding expert actions. Your NN will implement the control policy that dictates how the agent behaves in differnet situations, and the approach of training this NN in a supervised manner from expert trajectories is called behavioral cloning. After training your NN, save it to a file (use torch.save). Then, load the model in play atari.py and see how well your AI “player” does against the computer. To receive full credit, your trained agent should be able to beat the computer (i.e., reach 21 points first). (Note: for this part of the homework, you need to use a standard Python interpreter – Google Colab cannot render the frames of the video game.) In addition to your Python code (homework4 WPIUSERNAME1.py or homework4 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework4 WPIUSERNAME1.pdf or homework4 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above.
1. Feed-forward neural network [60 points]: In this problem you will train a multi-layer neural network to classify images of fashion items (10 different classes) from the Fashion MNIST dataset. Similarly to Homework 3, the input to the network will be a 28 × 28-pixel image; the output will be a real number. Specifically, the network you create should implement a function f : R 784 → R 10, where: z (1) = W(1)x + b (1) h (1) = relu(z (1)) z (2) = W(2)h (1) + b (2) . . . . . . z (l) = W(l)h (l−1) + b (l) ˆy = softmax(z (l) ) The network specified above is shown in the figure below: … … … x z (1) z (2) … h (1) … h (2) W(1) W(2) b (2) … ^ z (l) … … y b (l) b (1) W(l) … As usual, the (unregularized) cross-entropy cost function should be fCE(W(1) , b (1) , . . . ,W(l) , b (l) ) = − 1 n Xn i=1 X 10 k=1 y (i) k log ˆy (i) k where n is the number of examples. Hyperparameter tuning: In this problem, there are several different hyperparameters and architectural design decisions that will impact the network’s performance: • Number of hidden layers (suggestions: {3, 4, 5}) • Number of units in each hidden layer (suggestions: {30, 40, 50}) • Learning rate (suggestions: {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}) • Minibatch size (suggestions: 16, 32, 64, 128, 256) • Number of epochs • L2 Regularization strength applied to the weight matrices (but not bias terms) • Frequency & rate of learning rate decay • Variance & type of random noise added to training examples for data augmentation • . . . 1 These can all have a big impact on the test accuracy. In contrast to previous assignments, there is no specific requirement for how to optimize them. However, in practice it will be necessary to do so in order to get good results. Numerical gradient check: To make sure that you are implementing your gradient expressions correctly, you should use the check grad (and possibly its sister function, approx fprime). These methods take a function f (i.e., a Python method that computes a function; in practice, this will be the regularized cross-entropy function you code) as well as a set of points on which to compute f’s derivative (some particular values for the weights and biases). The approx fprime will return the numerical estimate of the gradient of f, evaluated at the points you provided. The check grad also takes another parameter, ∇f, which is what you claim is the Python function that returns the gradient of the function f you passed in. check grad computes the discrepancy (averaged over a set of points that you specified) between the numerical and analytical derivatives. Both of these methods require that all the parameters of the function (in practice: the weights and biases of your neural network) are “packed” into a single vector (even though the parameters actually constitute both matrices and vectors). For this reason, the starter code we provide includes a method called unpack that take a vector of numbers and extracts W(1) , W(2), . . . , as well as b (1) , b (2), . . . . Note that the training data and training labels are not parameters of the function f whose gradient you are computing/estimating, even though they are obviously needed by the cross-entropy function to do its job. For this reason, we “wrap” the call to fCE with a Python lambda expression in the starter code. Your tasks: (a) Implement stochastic gradient descent (SGD; see Section 5.9 and Algorithm 6.4 in the Deep Learning textbook, https://www.deeplearningbook.org/) for the multi-layer neural network shown above. Important: your backprop algorithm must work for any number of hidden layers. (b) Verify that your gradient function is correct using the check grad method. In particular, include in your PDF the real-valued output of the call to check grad for a neural network with 3 hidden layers, each with 64 neurons (the discrepancy between the numerical and analytical derivative for this case should be less than 1e-4). (c) Include a screenshot in your submitted PDF file showing multiple iterations of SGD (just to show that you actually ran your code successfully). For each iteration, report both the test accuracy and test unregularized cross-entropy. For full credit, the accuracy (percentage correctly classified test images) should be at least 88%. (d) Visualize the first layer of weights W(1) that are learned after training your best neural network. In particular, reshape each row of the weights matrix into a 28 × 28 matrix, and then create a “grid” of such images. Include this figure in your PDF. Here is an example (for W(0) ∈ R 64×784). Recommended strategy: First, (1) implement the forward- and back-propagation phases so that you can perfectly (as verified with check grad) compute the gradient for just a single training example. You will likely (unless you do everything perfectly from the get-go) need to “break” the 2 gradient vector, as returned by your gradCE and the approx fprime methods, into its individual components (for the individual weight matrices and bias terms) so that you can compare each of them one-by-one and see where any problems lie. Next, (2) you can implement minibatches of size ˜n > 1 by simply iterating over the minibatch in a for-loop. Finally – and only after you have correctly implemented step (2) – replace the for-loop (which is relatively slow) with matrix operations that compute the same result. In addition to your Python code (homework3 WPIUSERNAME1.py or homework3 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework3 WPIUSERNAME1.pdf or homework3 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above.
1. A linear NN will never solve the XOR problem [10 points, on paper]: Read the description of the XOR problem from Section 6.1 of the Deep Learning textbook: https://www.deeplearningbook. org/contents/mlp.html. Assume that you wish to classify the 4 points in this dataset using a 2-layer linear NN (i.e., the same model as in Homework 1, Problem 2). Then show (by deriving the gradient, setting to 0, and solving mathematically, not in Python) that the values for w = [w1, w2] ⊤ and b that minimize the function fMSE(w, b) in Equation 6.1 are: w1 = 0, w2 = 0, and b = 0.5 – in other words, the best prediction line is simply flat and always guesses ˆy = 0.5. 2. Derivation of softmax regression gradient updates [20 points, on paper]: As explained in class, let W =w(1) . . . w(c)be an m×c matrix containing the weight vectors from the c different classes. The output of the softmax regression neural network is a vector with c dimensions such that: yˆk = P exp zk c k′=1 exp zk′ (1) zk = x ⊤w(k) + bk for each k = 1, . . . , c. Correspondingly, our cost function will sum over all c classes: fCE(W, b) = − 1 n Xn i=1 Xc k=1 y (i) k log ˆy (i) k Important note: When deriving the gradient expression for each weight vector w(l) , it is crucial to keep in mind that the weight vector for each class l ∈ {1, . . . , c} affects the outputs of the network for every class, not just for class l. This is due to the normalization in Equation 1 – if changing the weight vector increases the value of ˆyl , then it necessarily must decrease the values of the other ˆyl ′̸=l . In this homework problem, please complete the following derivation that is outlined below: Derivation: For each weight vector w(l) , we can derive the gradient expression as: ∇w(l) fCE(W, b) = − 1 n Xn i=1 Xc k=1 y (i) k ∇w(l) log ˆy (i) k = − 1 n Xn i=1 Xc k=1 y (i) k∇w(l) yˆ (i) k yˆ (i) k ! We handle the two cases l = k and l ̸= k separately. For l = k: ∇w(l) yˆ (i) k = complete me… = x (i) yˆ (i) l (1 − yˆ (i) l ) 1 For l ̸= k: ∇w(l) yˆ (i) k = complete me… = −x (i) yˆ (i) k yˆ (i) l To compute the total gradient of fCE w.r.t. each w(k) , we have to sum over all examples and over l = 1, . . . , c. (Hint: P k ak = al + P k̸=l ak. Also, P k yk = 1.) ∇w(l) fCE(W, b) = − 1 n Xn i=1 Xc k=1 y (i) k ∇w(l) log ˆy (i) k = complete me… = − 1 n Xn i=1 x (i) y (i) l − yˆ (i) l Finally, show that ∇bfCE(W, b) = − 1 n Xn i=1 y (i) − yˆ (i) 3. Implementation of softmax regression [25 points, in Python code]: Train a 2-layer softmax neural network to classify images of fashion items (10 different classes, such as shoes, t-shirts, dresses, etc.) from the Fashion MNIST dataset. The input to the network will be a 28 × 28-pixel image (converted into a 784-dimensional vector); the output will be a vector of 10 probabilities (one for each class). The cross-entropy loss function1 that you minimize should be fCE(w(1) , . . . , w(10), b(1), . . . , b(10)) = − 1 n Xn i=1 X 10 k=1 y (i) k log ˆy (i) k + α 2 Xc k=1 w(k)⊤ w(k) where n is the number of examples and α is a regularization constant.. Note that each ˆyk implicitly depends on all the weights W =w(1) , . . . , w(10) and biases b =b (1), . . . , b(10) . To get started, first download the Fashion MNIST dataset from the following web links: • https://s3.amazonaws.com/jrwprojects/fashion_mnist_train_images.npy • https://s3.amazonaws.com/jrwprojects/fashion_mnist_train_labels.npy • https://s3.amazonaws.com/jrwprojects/fashion_mnist_test_images.npy • https://s3.amazonaws.com/jrwprojects/fashion_mnist_test_labels.npy These files can be loaded into numpy using np.load. Each “labels” file consists of a 1-d array containing n labels (valued 0-9), and each “images” file contains a 2-d array of size n×784, where n is the number of images. Next, implement stochastic gradient descent (SGD) to minimize the cross-entropy loss function on this dataset. Regularize the weights but not the biases. Optimize the same hyperparameters as in 1 In this equation, the regularization term is not divided by n like in the lecture notes. Either equation is valid since the 1/n can be subsumed into α. Here, for simplicity, the 1/n is omitted. 2 homework 1 problem 2 (age regression). You should also use the same methodology as for the previous homework, including the splitting of the training files into validation and training portions. Performance evaluation: Once you have tuned the hyperparameters and optimized the weights so as to maximize performance on the validation set, then: (1) stop training the network and (2) evaluate the network on the test set. Record the performance both in terms of (unregularized) cross-entropy loss (smaller is better) and percent correctly classified examples (larger is better); put this information into the PDF you submit. Hint 1: it accelerates training if you first normalize all the pixel values of both the training and testing data by dividing each pixel by 255. Hint 2: when using functions like np.sum and np.mean, make sure you know what the axis and keepdims parameters mean and that you use them in a way that is consistent with the math! 4. Logistic Regression [15 points, on paper]: Consider a 2-layer neural network that computes the function yˆ = σ(x ⊤w + b) where x is an example, w is a vector of weights, b is a bias term, and σ is the logistic sigmoid function. Assume we train this network using the log loss, as described in class. Moreover, suppose all the training examples are positive. Answer the following questions about convergence. (Informally, a sequence of numbers converges if it gets closer and closer to a specific number as the sequence progresses. A sequence that does not converge can do different things, e.g., change erratically, or grow towards +/−∞.) While you are not required to give formal proofs, you should explain your reasoning, which could either be a mathematical argument or a simulation result. Put your answers into your PDF file. (a) Given a well-chosen learning rate: what value will the training loss converge to during gradient descent? (b) Given a well-chosen learning rate: will b always converge; does convergence depend on the exact training examples; or does it never converge? (c) Suppose the training set contains exactly 2 examples, x (1) , x (2) ∈ R 2 . Give specific non-zero values for these training data such that: i. w will converge during gradient descent (given a well-chosen learning rate). ii. w will not converge during gradient descent (no matter what the learning rate). Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If you are working as part of a group, then only one member of your group should submit (but make sure you have already signed up in a pre-allocated team for the homework on Canvas). Please name your submission files using your names in the following format: ( ). For example: jacob whitehill yao su.zip
A robot is wandering around a room with some obstacles, labeled as # in the grid below. It can occupy any of the free cells labeled with a letter, but we are uncertain about its true location and thus keep a belief distribution over its current location. At each timestep it moves from its current cell to a neighboring free cell in one of the four cardinal directions with uniform probability; it cannot stay in the same cell.For example, from A the robot can move to either B or C with probability 1 2 , while from D it can move to B, C, E, or F, each with probability 1 4 . A B # C D E # F #The robot may also make an observation after each transition, returning what it sees in a random cardinal direction. Possibilities include observing #, “wall”, or “empty” (for a free cell). For example, in B the robot observes “wall”, # (each with probability 1 4 ), or “empty” (probability 1 2 ).(a) Suppose the robot wanders around forever without making any observations. What is the stationary distribution π over the robot’s predicted location? Hint: You can use numpy.linalg.eig in Python. The first return value is a 1D array of eigenvalues; the second return value is a 2D array, where each column is a corresponding eigenvector. Remember that eigenvectors may not sum to 1 by default.(b) Now suppose that we know that the robot is in state D; i.e., Pr(X0 = D) = 1. Starting from this state, the robot makes one transition and observes e1 = #. What is the updated belief distribution Pr(X1 | e1)?(c) The robot makes a second transition and observes e2 = “empty”. What is the updated belief distribution Pr(X2 | e1, e2)?(d) Compute the joint distribution Pr(X1, X2 | e1, e2). You do not need to explicitly list the values that have probability 0. What is the most likely state sequence(s)?(e) Compute b = Pr(e2 | X1). Briefly explain what this quantity represents.(f) Compute the smoothed distribution Pr(X1 | e1, e2) by multiplying f = Pr(X1 | e1) with b = Pr(e2 | X1) and normalizing the result. Confirm that the distribution is the same as that obtained from marginalization of Pr(X1, X2 | e1, e2).We will investigate the absence of conditional independence guarantees between two random variables when an arbitrary descendant of a common effect is observed. We will consider the simple case of a causal chain of descendants:Suppose that all random variables are binary. The marginal distributions of A and B are both uniform (0.5, 0.5), and the CPTs of the common effect D0 and its descendants are as follows: A B Pr(+d0 | A, B) +a +b 1.0 +a −b 0.5 −a +b 0.5 −a −b 0.0 Di−1 Pr(+di | Di−1) +di−1 1.0 −di−1 0.0(a) Give an analytical expression for the joint distribution Pr(D0, D1, · · · , Dn). Your expression should only contain CPTs from the Bayes net parameters. What is the size of the full joint distribution, and how many entries are nonzero?(b) Suppose we observe Dn = +dn. Numerically compute the CPT Pr(+dn|D0). Please show how you can solve for it using the joint distribution in (a), even if you do not actually use it.(c) Let’s turn our attention to A and B. Give a minimal analytical expression for Pr(A, B, D0, +dn). Your expression should only contain CPTs from the Bayes net parameters or the CPT you found in part (b) above.(d) Lastly, compute Pr(A, B | +dn). Show that A and B are not independent conditioned on Dn.In this problem you will explore part-of-speech (POS) tagging, a standard task in natural language processing. The goal is to identify parts of speech and related labels for each word in a given corpus. Hidden Markov models are well suited for this problem, with parts of speech being hidden states and the words themselves being observations.We will be using data from the English EWT treebank from Universal Dependencies, which uses 17 POS tags. We are providing clean versions of training and test data for you. The data format is such that each line contains a word and associated tag, and an empty line signifies the end of a sentence. Feel free to open the files in a text editor to get an idea.The provided Python file contains a couple of functions that can be used to read a data file (you do not need to call these yourself). The global variable POS contains a list of all possible parts of speech. You will be filling in the remaining functions in the file and running the code in main where instructed.3.1: Supervised Learning (12 points) Your first task is to learn the three sets of HMM parameters from the training data. The initial distribution Pr(X0) will be stored in a 1D array of size 17. The transition probabilities will be stored in a 2D array of size 17 × 17. The observation probabilities will be stored in a dictionary, where each key is a word and the value is a 1D array (size 17) of probabilities Pr(word | POS).These probabilities should follow the same order as in the POS list. Implement learn model so that it compiles and returns these three structures. The data input is a list of sentences, each of which is a list of (word, POS) pairs. Your method should iterate over each sentence in the training data, counting the POS appearances in the first word, the number of POS to POS transitions, and the number of POS to word observations. Treat each sentence independently; do not count transitions between different sentences.Be sure to correctly normalize all of these distributions. Make sure that the quantities P i Pr(X0)i P , i Pr(Xt+1 = i | Xt = j), and P k Pr(Et = k | Xt) are all equal to 1.3.2: Viterbi Algorithm (12 points) The next task is to implement the Viterbi algorithm to predict the most likely sequence of states given a sequence of observations. We will break this into two pieces: viterbi forward and viterbi backward. viterbi forward takes in four parameters: initial distribution X0 (1D array), transition matrix Tprobs (2D array), observation probabilities Oprobs (dictionary), and observation sequence obs (list of word observations) of length T.viterbi forward should compute and return two quantities: maxx1,…,xT −1 Pr(x1, …, xT −1, XT , e1:T ) as a 1D array of size 17, and a T ×17 2D array of pointer indices. Row i of the pointer array should contain argmaxxi−1 Pr(x1, …, xi−1, Xi , e1:i).For simplicity, pointers will be the indices of the POS in the POS array rather than strings. Note that it is possible for an observation e to not exist in the Oprobs dictionary if it was not present when the model was trained. If this occurs, you may simply take Pr(e | x) = 1 for all POS x; this is equivalent to skipping the observation step.viterbi backward takes the two quantities returned by viterbi forward as parameters. It should start with the most likely POS according to maxx1,…,xT −1 Pr(x1, …, xT −1, XT , e1:T ) and then follow the pointers backward to reconstruct the entire POS sequence argmaxx1,…,xT Pr(x1, …, xT | e1:T ).The returned object should be a list of POS (strings) from state 1 to state T. Note that the predicted state for X0 should not be included, so your list should be length T.3.3: Model Evaluation (8 points) Your Viterbi implementation can now be used for prediction. evaluate viterbi takes in the HMM parameters, along with a data list in the same format as in learn model. Complete the function so that it runs Viterbi on each sentence separately on the data set. Then compare all returned POS predictions with the true POS in data. Compute and return the accuracy rate as the proportion of correct predictions (this should be a number between 0 and 1). After implementing this function, run the associated code in main and answer the following questions.(a) Report the accuracies of your Viterbi implementation on each data set. Why is the accuracy on the test data set lower than that of the training set?(b) Why can we not expect 100% accuracy on the training set, despite defining the HMM parameters to maximize the likelihood on the training data set?3.4: Forward and Backward Algorithms (12 points) Next, implement the the forward, backward, and forward-backward algorithms (ideally fewer than 5 lines of code each). forward takes in the same four parameters as viterbi forward above, and it computes and returns Pr(Xk, e1:k) (no normalization). backward also takes in four parameters, dropping X0 but newly including the state index k. It should compute and return Pr(ek+1:T | Xk).Note that k follows Python indexing; in other words, k=0 corresponds to X1 and backward should return Pr(e2:T | X1). Again, to deal with the scenario in which a word is not in the observation model, you should explicitly check when this occurs and if so simply use Pr(e | x) = 1.Once you have both forward and backward, it should be straightforward to call these to implement forward backward, which computes the smoothed state distribution Pr(Xk | e1:T ). Note that the call to forward should only use the observations up to (and including) the kth one. Don’t forget to normalize the smoothed distribution.3.5: Inference Comparisons (6 points) Now that you have all of these inference algorithms implemented, come up with a short English phrase p, ideally around 10 words or fewer. You should then identify a word w within this phrase, such that (i) the most likely POS of w according to the result of the forward algorithm on the first portion of p up to word w, and (ii) the most likely POS of w according to the result of the forward-backward algorithm on the entirety of p, are different.Leave the code that you used to obtain these observations in the main function of the code file. In your writeup, give the phrase that you used and the word that you evaluated. Indicate the POS returned by each of the results as described above. Give a brief explanation about why each of the methods returns something different.Submission You should have one PDF document containing your solutions for problems 1-2, as well as your responses to 3.3 and 3.5. You should also have a completed Python file implementing problem 3; make sure that all provided function headers and the filename are unchanged. Submit the document and .py code file to the respective assignment bins on Gradescope. For full credit, you must tag your pages for each given problem on the former.
1. Python and Numpy Warm-up Exercises [20 pts]: This part of the homework is intended to help you practice your linear algebra and how to implement linear algebraic and statistical operations in Python using numpy (to which we refer in the code below as np). For each of the problems below, write a method (e.g., problem 1a) that returns the answer for the corresponding problem. In all problems, you may assume that the dimensions of the matrices and/or vectors that are given as input are compatible for the requested mathematical operations. You do not need to perform error-checking. Note 1: In mathematical notation we usually start indices with j = 1. However, in numpy (and many other programming settings), it is more natural to use 0-based array indexing. When answering the questions below, do not worry about “translating” from 1-based to 0-based indexes. For example, if the (i, j)th element of some matrix is requested, you can simply write A[i,j]. Note 2: To represent and manipulate vectors and matrices, please use numpy’s array class (not the matrix class). Note 3: While the difference between a row vector and a column vector is important when doing math, numpy does not care about this difference as long as the array is 1-D. This means, for example, that if you want to compute the inner product between two vectors x and y, you can just write x.dot(y) without needing to transpose the x. If x and y are 2-D arrays, however, then it does matter whether they are row-vectors or column-vectors, and hence you might need to transpose accordingly. (a) Given two matrices A and B, compute and return an expression for A + B. [ 0 pts ] Answer : While it is completely valid to use np.add(A, B), this is unnecessarily verbose; you really should make use of the “syntactic sugar” provided by Python’s/numpy’s operator overloading and just write: A + B. Similarly, you should use the more compact (and arguably more elegant) notation for the rest of the questions as well. (b) Given matrices A, B, and C, compute and return AB − C (i.e., right-multiply matrix A by matrix B, and then subtract C). Use dot or np.dot. [ 1 pts ] (c) Given matrices A, B, and C, return A⊙B+C⊤, where ⊙ represents the element-wise (Hadamard) product and ⊤ represents matrix transpose. In numpy, the element-wise product is obtained simply with *. [ 1 pts ] (d) Given column vectors x and y, compute the inner product of x and y (i.e., x ⊤y). [ 1 pts ] (e) Given matrix A and integer i, return the sum of all the entries in the ith row whose column index is even, i.e., P j:j is even Aij . Do not use a loop, which in Python can be very slow. Instead use the np.sum function. [ 2 pts ] (f) Given matrix A and scalars c, d, compute the arithmetic mean over all entries of A that are between c and d (inclusive). In other words, if S = {(i, j) : c ≤ Aij ≤ d}, then compute 1 |S| P (i,j)∈S Aij . Use np.nonzero along with np.mean. [ 2 pts ] (g) Given an (n×n) matrix A and integer k, return an (n×k) matrix containing the right-eigenvectors of A corresponding to the k eigenvalues of A with the largest absolute value. Use np.linalg.eig. [ 2 pts ] 1 (h) Given a column vector (with n components) x, an integer k, and positive scalars m, s, return an (n × k) matrix, each of whose columns is a sample from multidimensional Gaussian distribution N (x + mz, sI), where z is column vector (with n components) containing all ones and I is the identity matrix. Use either np.random.multivariate normal or np.random.randn. [ 2 pts ] (i) Given a matrix A with n rows, return a matrix that results from randomly permuting the columns (but not the rows) in A. [ 2 pts] (j) Z-scoring: Given a vector x, return a vector y such that each yi = (xi − x)/σ, where x is the mean (use np.mean) of the elements of x and σ is the standard deviation (use np.std). [ 2 pts ] (k) Given an n-vector x and a non-negative integer k, return a n × k matrix consisting of k copies of x. You can use numpy methods such as np.newaxis, np.atleast 2d, and/or np.repeat. [ 2 pts ] (l) Given a k × n matrix X =x (1) . . . x (n)and a k × m matrix Y =y (1) . . . y (m), compute an n × m matrix D = d11 . . . d1m . . . dn1 . . . dnm consisting of all pairwise L2 distances dij = ∥x (i) − y (j)∥2. In this problem you may not use loops. Instead, you can avail yourself of numpy objects & methods such as np.newaxis, np.atleast 3d, np.repeat, np.swapaxes, etc. (There are various ways of solving it.) Hint: from X (resp. Y), construct a 3-d matrix that contains multiple copies of each of the vectors in X (resp. Y); then subtract these 3-d matrices. [ 3 pts ] 2. Training 2-Layer Linear Neural Networks with Stochastic Gradient Descent [25 pts]: (a) Train an age regressor that analyzes a (48 × 48 = 2304)-pixel grayscale face image and outputs a real number ˆy that estimates how old the person is (in years). The training and testing data are available here: • https://s3.amazonaws.com/jrwprojects/age_regression_Xtr.npy • https://s3.amazonaws.com/jrwprojects/age_regression_ytr.npy • https://s3.amazonaws.com/jrwprojects/age_regression_Xte.npy • https://s3.amazonaws.com/jrwprojects/age_regression_yte.npy Your prediction model g should be a 2-layer linear neural network that computes ˆy = g(x; w) = x ⊤w + b, where w is the vector of weights and b is the bias term. The cost function you should optimize is fMSE(w, b) = 1 2n Xn i=1 (ˆy (i) − y (i) ) 2 where n is the number of examples in the training set Dtr = {(x (1), y(1)), . . . ,(x (n) , y(n) )}, each x (i) ∈ R 2304 and each y (i) ∈ R. To optimize the weights, you should implement stochastic gradient descent (SGD). Note: you must complete this problem using only linear algebraic operations in numpy – you may not use any off-the-shelf linear regression or neural network software, as that would defeat the purpose. There are several different hyperparameters that you will need to optimize: • Mini-batch size ˜n. • Learning rate ϵ. • Number of epochs. In order not to cheat (in the machine learning sense) – and thus overestimate the performance of the network – it is crucial to optimize the hyperparameters only on a validation set. (The training set would also be acceptable but typically leads to worse performance.) To create 2 a validation set, simply set aside a fraction (e.g., 20%) of the age regression Xtr.npy and age regression ytr.npy to be the validation set; the remainder (80%) of these data files will constitute the “actual” training data. While there are fancier strategies (e.g., Bayesian optimization) that can be used for hyperparameter optimization, it’s often effective to just use a grid search over a few values for each hyperparameter. In this problem, you are required to explore systematically (e.g., using nested for loops) at least 2 different values for each hyperparameter. Performance evaluation: Once you have tuned the hyperparameters and optimized the weights and bias term so as to minimize the cost on the validation set, then: (1) stop training the network and (2) evaluate the network on the test set. Report both the training and the test fMSE in the PDF document you submit, as well as the training cost values for the last 10 iterations of gradient descent (just to show that you actually executed your code). 3. Gradient descent: what can go wrong? [30 pts] Please enter your code in a file named homework1 problem3.py, with one Python function (e.g., problem3a) for each subproblem. (a) [10 pts]: The graph below plots the function f(x) that is defined piece-wise as: f(x) = −x 3 : x < −0.1 −3x/100 − 1/500 : −0.1 ≤ x < 3 −(x − 31/10)3 − 23/250 : 3 ≤ x < 5 1083 200 (x − 6)2 − 6183/500 : x >= 5 4 2 0 2 4 6 8 10 0 20 40 60 As you can see, the function has a long nearly flat section (sometimes known as a plateau) just before the minimum.1 Plateaux can cause big problems during optimization. To show this: i. Derive by hand the (piece-wise) function ∇f and implement it in Python/numpy. ii. Use your implementation of ∇f to conduct gradient descent for T = 100 iterations. Always start from an initial x = −3. Try using various learning rates: 1e-3, 1e-2, 1e-1, 1e0, 1e1. Plot f, ∇f, as well as superimposed dots that show the sequence ((x (1), y(1)), . . . ,(x (T) , y(T) )) of gradient descent. Use plt.legend to indicate which scatter plot corresponds to which learning rate. iii. Describe in 1-2 sentences what you observe during gradient descent for the set of learning rates listed above. iv. Find a learning rate ϵ for which gradient descent successfully converges to min f(x), and report ϵ in the PDF file. (b) [8 pts]: Even a convex paraboloid – i.e., a parabola in multiple dimensions that has only one local minimum and no plateaux – can cause problems for “vanilla” SGD (i.e., the kind we’ve learned in 1The ugly constants in f were chosen to give rise to these characteristics while ensuring that it remains differentiable. 3 class so far). Examine the scatter-plot below which shows the sequence ((x (1), y(1)), . . . ,(x (T) , y(T) )) of gradient descent on a convex paraboloid f, starting at x (1) = [1, −3]⊤, where each x ∈ R 2 . The descent produces a zig-zag pattern that takes a long time to converge to the local minimum. 4 2 0 2 4 x1 3 2 1 0 1 2 3 4 x2 i. Speculate how the SGD trajectory would look if the learning rate were made to be very small (e.g., 100x smaller than in the figure above). ii. Let f(x1, x2) = a1(x1 − c1) 2 + a2(x2 − c2) 2 . Pick values for a1, a2, c1, c2 so that – when the scatter-plot shown above is superimposed onto it – the gradient descent is realistic for f. Rather than just guess randomly, consider: why would the zig-zag be stronger in one dimension than another, and how would this be reflected in the function’s constants? Plot a contour graph using plt.contour and superimpose the scatter-plot using plt.scatter. You can find the gradient descent sequence in gradient descent sequence.txt. Note that you are not required to find the exactly correct constants or even to estimate them algorithmically. Rather, you can combine mathematical intuition with some trial-and-error. Your solution should just look visually “pretty close” to get full credit. Note: to ensure proper rendering, use plt.axis(’equal’) right before calling plt.show(). (c) [6 pts]: This problem is inspired by this paper. Consider the function f(x) = 2 3 |x| 3/2 . Derive ∇f, implement gradient descent and plot the descent trajectories ((x (1), y(1)), . . . ,(x (T) , y(T) )) for a variety of learning rates 1e-3, 1e-2, 1e-1 and a variety of starting points. See what trend emerges, and report it in the PDF. (d) [6 pts]: While very (!) unlikely, it is theoretically possible for gradient descent to converge to a local maximum. Give the formula of a function (in my own solution, I used a degree-4 polynomial), a starting point, and a learning rate such that gradient descent converges to a local maximum after 1 descent iteration (i.e., after 1 iteration, it reaches the local maximum, and the gradient is exactly 0). Prove (by deriving the exact values for the descent trajectory) that this is true. You do not need to implement this in code (and, in fact, due to finite-precision floating-point arithmetic, it might not actually converge as intended). Submission: Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If you are working as part of a group, then only one member of your group should submit (but make sure you have already signed up in a pre-allocated team for the homework on Canvas).
For this assignment, you ONLY need to answer either Q1 or Q2 Q1 In this problem, we will practice building CART models with a continuous outcome, using the dataset StateData.csv which has data from 1970s on all fifty US states. A description of the variables in the dataset is given in Table 1.Variable Description Population Population estimate of the state in 1975. Income Per capita income in the state in 1974. Illiteracy Illiteracy rates in 1970, as a percentage of the state’s population. LifeExp The life expectancy in years of residents of the state in 1970. Murder The murder and non-negligent manslaughter rate per 100,000 population in 1976.HighSchoolGrad The high-school graduation rate in the state in 1970. Frost The mean number of days with minimum temperature below freezing from 1931 to 1960 in the capital or a large city of the state. Area The land area (in square miles) of the state.Longitude The longitude of the center of the state. Latitude The latitude of the center of the state. Region The region (Northeast, South, North Central, or West) that the state belongs to. Table 1: Variables in the dataset StateData.csv.(a) Let us start by building a linear regression model. Randomly split the dataset into a training set (70%) and a test set (30%).(i) First, build a linear regression model to predict LifeExp using the following several variables as the independent variables: Population, Murder, Frost, Income, Illiteracy, Area, and HighSchoolGrad. Use the training dataset to build the model. What is the R2 of the model on the test set?(ii) Now, build a linear regression model to predict LifeExp the following four variables as the independent variables: Population, Murder, Frost, and HighSchoolGrad. Again, use the training dataset to build the model. What is the R2 of the model on the test set?(iii) Compare these two models. What are we achieving by removing independent variables? What is the equivalent procedure in a CART model?(b) Now, build a CART model to predict LifeExP using the following seven variables as the independent variables: Population, Murder, Frost, Income, Illiteracy, Area, and HighSchoolGrad. Set the parameter minbucket to be 5. Make sure that you are building a regression tree, and not a classification tree, by setting the argument method to “anova” instead of “class”.(i) Plot the trees. Which of the independent variables appear in the tree? Do you find the linear regression model or the CART model easier to interpret?(ii) Compute the predicted life expectancies for the test dataset using the CART model, and calculate the R2 of the predictions.(c) Now, build a random forest model to predict LifeExP using the same seven variables as the independent variables. Set the parameter nodesize to 5. Compute the predicted life expectancies for the test dataset using the random forest model, and calculate the R2 of the predictions.(d) Which of the four models you built do you think is the best model, if out-of-sample accuracy is the most important. How about if interpretability is the most important?Q2 Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, over 400 million results are returned.This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings, like this one. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.Clustering methods can be used to automatically group search results into categories, making it easier to find relevant results. The two most common clustering algorithms used for document clustering are Hierarchical and K-means.In this problem, we will be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. The file DailyKos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos.These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (Republican candidate) and Senator John Kerry (Democratic candidate). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.There are 1,545 variables in this dataset. Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). For each document, or observation, the variable values are the number of times that word appeared in the document. (This approach is called bag of words in text analytics.)(a) Start by building a Hierarchical Clustering model to cluster documents using all of the variables in the dataset. Indicate which distance metrics you used for distances between the observations and distances between the clusters.(i) Building a hierarchical clustering model will probably take a significant amount of time on this dataset. Why?(ii) Plot the dendrogram of your hierarchical clustering model. Using the dendrogram and thinking about this particular application, which number of clusters would you recommend? Keep in mind that document clustering would most likely be used by Daily Kos to show readers categories to choose from when trying to decide which articles to read.(iii) Assign each observation to a cluster, using the number of clusters you recommended in the previous subproblem. How many observations are in each cluster?(iv) Split your dataset into a dataset for each cluster, using your cluster assignments. Then, find the six most frequent words in each cluster. If you name the first HC1 in R, this can be done with the command: tail(sort(colMeans(HC1))).Describe each cluster. Is there a cluster that is mostly about the Iraq war? Is there a cluster that is mostly about the democratic party? It might be helpful to know that in 2004, Howard Dean was one of the candidates for the Democratic nomination for the President of the United States, John Kerry was the candidate who won the democratic nomination, and John Edwards was the running mate of John Kerry (the Democratic Vice President nominee).(b) Now cluster the documents using K-means clustering. Choose the same number of clusters that you recommended for Hierarchical clustering.(i) How many observations are in each cluster? Is your answer the same as it was with Hierarchical clustering? Why or why not?(ii) Just like you did for Hierarchical clustering, split your dataset into a dataset for each K-means cluster, and analyze the most frequent words in each cluster. Are the clusters similar to the Hierarchical clusters?Can you find a similar Hierarchical cluster for each K-means cluster? Keep in mind that the order of the clusters (which cluster is labeled as 1, which cluster is labeled as 2, etc.) is meaningless — for example, Hierarchical cluster 3 might be very similar to K-means cluster 1.
In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders would like to minimize the risk of a borrower being unable to repay a loan.In this exercise, we will use publicly available data from LendingClub, a website that connects borrowers and investors over the internet. The dataset is in the file Loans.csv. There are 9,578 observations, each representing a 3-year loan that was funded through the LendingClub.com platform between May 2007 and February 2010. There are 18 variables in the dataset, described in Table 1. We will be trying to predict NotFullyPaid, using all of the other variables as independent variables.(a) Let us start by building a logistic regression model.(i) First, randomly split the dataset Loans.csv into a training set and a testing set. Put 70% of the data in the training set. What is the accuracy on the test set of a simple baseline model that predicts that all loans will be paid back in full (NotFullyPaid = 0)? Our goal will be to build a model that adds value over this simple baseline method.(ii) Now, build a logistic regression model that predicts the dependent variable NotFullyPaid using all of the other variables as independent variables. Use the training set as the data for the model. Describe your resulting model. Which of the independent variables are significant in your model?(iii) Consider two loan applications, which are identical other than the fact that the borrower in Application A has a FICO credit score of 700 while the borrower in Application B has a FICO credit score of 710. Let Logit (A) be the value of the linear logit function of loan A not being paid back in full, according to our logistic regression model, and define Logit (B) similarly of loan B. What is the value of Logit (A) minus Logit (B)?(iv) Now predict the probability of the test set loans not being paid back in full. What is the accuracy of the logistic regression model on the test set using a threshold of 0.5? How does this compare to the baseline model?(b) LendingClub assigns the interest rate to a loan based on their estimate of that loan’s risk. This variable, IntRate, is an independent variable in our dataset. In this part, we will investigate just using the loan’s interest rate as a “Smart baseline” to order the loans according to risk.(i) Using the training set, build a logistic regression model that predicts the dependent variable NotFullyPaid using IntRate as the only independent variable. Is IntRate significant in this model? Was it significant in the first logistic regression model you built? How would you explain this difference?(ii) Use the model you just built (with only one independent variable) to make predictions for the observations in the test set. What is the highest predicted probability of a loan not being paid back in full on the test set? How many loans would we predict would not be paid back in full if we used a threshold of 0.5 to make predictions?Variable Description CreditPolicy 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. Purpose.CC 1 if the purpose of the loan is related to Credit Card, and 0 otherwise. Purpose.DC 1 if the purpose of the loan is related to Debt Consolidation, and 0 otherwise. Purpose.Edu 1 if the purpose of the loan is related to Education, and 0 otherwise.Purpose.MP 1 if the purpose of the loan is related to Major Purchase, and 0 otherwise. Purpose.SB 1 if the purpose of the loan is related to Small Business, and 0 otherwise. IntRate The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Note that borrowers judged by LendingClub to be more risky are assigned higher interests.Installment The monthly installments ($) owed by the borrower if the loan is funded. LogAnnualInc The natural log of the self-reported annual income of the borrower. Dti The debt-to-income ratio of the borrower (amount of debt divided by annual income).Fico The FICO credit score of the borrower. DaysWithCrLine The number of days the borrower has had a credit line. RevolBal The borrower’s revolving balance (amount unpaid at the end of the credit billing cycle).RevolUtil The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).InqLast6mths The borrower’s number of inquiries by creditors in the last 6 months. Delinq2yrs The number of times the borrower had been 30+ days past due on a payment in the past 2 years.PubRec The borrower’s number of derogatory public records (bankruptcy fillings, tax liens, or judgements).NotFullyPaid 1 if the loan was not paid back in full, and 0 otherwise. Table 1: Variables in the dataset Loans.csv.
COMP3027 Algorithm Design Assignment 3 s1 2025 This assignment is due on May 7 and should be submitted on Gradescope. All submitted work must be done individually without consulting someone else’s solutions in accordance with the University’s “Academic Integrity” policies. As a first step go to the last page and read the section: “Advice on how to do the assignment.” Problem 1. (100 points) It is the year 3241 LY. You are the CEO of Theed Palace Space Vessel Engineering Corps (known as Theed Hangars for short), a starship manufacturing company based in the city of Theed on the planet Naboo. Among your esteemed range of products is the iconic N–1 starfighter, a single–seat patrol fighter used by the Royal Naboo Security Forces (RNSF). One thing many denizens of Naboo might not understand, is that while Theed Hangars designs and carries out the finishing touches and quality assurance passes on all of its vessels, the bulk of the actual manufacturing work is outsourced to a vast supply chain of factories scattered across the Free Trade Zone. Theed Hangars has sought to capitalize upon this era of peace and prosperity in the galaxy by seeking to maximally leverage comparative advantage, establishing a delicate web of efficient manufacturing hubs across a multitude of star systems over the course of many decades. To spare you the details, the three major elements of the N–1 starfighter con– struction pipeline are as follows: 1. The machining of the elegant N–1 spaceframe; 2. The attachment of the twin J-type engines, as well as the Monarc C-4 hyperdrive engine; 3. The fitout of the starfighter’s flight computer, sensor suite, armament, and life support systems. Each of the major factories in the N–1 starfighter supply chain carries out one of these three steps in its entirety. It is important that each starfighter under construc– tion undergo each of these steps in the preceding order, since each step depends crucially on the completion of the prior. (e.g. the calibration of the starfighter’s flight computer and control systems requires that the engines already be in place.) Unfortunately, not all is well in the Free Trade Zone – owing to the growing unrest with the excesses of the Trade Federation, many star systems have started moving towards imposing tariffs on inter–system trade. For a supply chain as del– icate and fragmented as that of the typical Theed Hangars vessel, the impact of tariffs would be disastrous. As CEO of Theed Hangars, you now find yourself scrambling from system to system, hoping to influence senators and other officials and obtain exemptions from their new and often harsh tariff regimes. Across many schmoozy meetings, a com– mon theme has emerged: quotas. If you can keep inter–system inputs into the fac– tories of a given system within a given quota, then the governing parties of the respective star systems will be able to sell a narrative to their constituents of bring– ing back local manufacturing to their system, and thus will exempt Theed Hangars from the brunt of the tariffs. While all of this schmoozing in the halls of power is all good and well, you also have some pressing issues to address within the supply chain itself: most of the manufacturing hubs were established as joint ventures with their respective star systems’ local inhabitants or with foreign investment groups. To keep a given factory open, it must remain profitable, which means it must produce a minimum output each year. On the other hand, the RNSF has increased its order of record for N–1 starfighter procurement quite substantially this year, to a lofty total of X spaceframes. As you continue to hop between star systems to lobby for exemptions, you at the very least need a way to check: given the exemptions you have currently managed to win, is it possible to meet the N–1 order of record while respecting the agreed–to quotas and keeping all factories profitable? Formally, you are given the following: • A non–negative integer order of record for X N–1 starfighters. • A set of star systems numbered from 1 to m. • For each pair of star systems i, j, and each T ∈ {frame, engines}, you are given a maximum allowable inter-system quota, a non–negative integer qijT in– dicating the maximum amount of goods of type T which are allowed to be imported from system i to system j. • A set of n factories. For each factory i, you are given... – Its type, Ti ∈ {frame, engines, fitout}. – Its location, a star system Li ∈ {1, . . . , m}. – Its required output to remain profitable, a non–negative integer ai – Its maximum manufacturing capacity, a non–negative integer bi. Your task is to determine whether it is possible to assemble a supply chain schedule (i.e. an assignment of spaceframes to frame, engines, and fitout factories), for X starfighters, such that each factory has capacity to complete work on the space– frames assigned to it, all inter–system quotas are respected, and so that all factories remain open. a) Describe the algorithm. [30 points] b) Prove its correctness. [50 points] c) Establish its time complexity. [20 points] Advice on how to do the assignment • Assignments should be typed and submitted as pdf (no pdf containing text as images, no handwriting). • Start by typing your student ID at the top of the first page of your submission. Do not type your name. • Submit only your answers to the questions. Do not copy the questions. • You may use paragraphs, lists or itemized points, pseudocode, or any combi– nation of the above to organize your algorithm description. Do not include actual code in your submission. (Penalties will apply!) • When designing an algorithm or data structure, it might help you (and us) if you give a brief (1–2 sentence) description of the overall idea, and after that develop and elaborate on details. If your algorithm is stated in pseudocode or the details of the algorithm are complex, then a brief description of the overall idea is required. Remember that an assignment in this course is an exercise in effectively presenting and communicating technical ideas (i.e. not just "solving the problem"), and will be graded accordingly. • Refrain from explaining "why" your algorithm does what it does in the algo– rithm description itself. The "why" belongs in your proof. Your algorithm description should be a lean, complete, and unambiguous statement of just "what" your algorithm is. • If you use pseudocode in your algorithm description, avoid adhering to any particular pseudocode "standard". Each line should endeavour to be "prosaic" in describing the intended step, with mathematical notation interspersed to denote any objects or concepts which you feel cannot be conveyed clearly in prose. Treat the pseudocode in Kleinberg and Tardos as exemplary. • In general, try to be prosaic rather than syntactic when describing your algo– rithm. Treat this as an advantage! You do not need to describe your algorithm to a compiler or interpreter, so you can focus on phrasing your description for ease of readability by a human reader. • Be careful with giving multiple or alternative answers. If you give multiple answers, then we will give you marks only for "your worst answer", as this indicates how well you understood the question. • You can use the material presented in the lectures or textbooks without prov– ing it. You do not need to write more than necessary (see comment above). • Unless otherwise stated, we always ask about worst–case analysis, worst–case running times, etc. • If you use further resources (books, scientific papers, the internet,...) to for– mulate your answers, then add references to your sources and explain it in your own words. Only citing a source doesn’t show your understanding and will thus get you very few (if any) marks. Copying from any source without reference is considered plagiarism.
MATH3014-6027 Design of Experiments SEMESTER 2 EXAMINATION 2021/22 1. [25 marks] Table 1 shows the results of an experiment conducted to study the influence of three different types of glass on the light output of an oscilloscope tube. Table 1: Oscillioscope experiment: light output (lumens) from three different glass types. Glass 1 Glass 2 Glass 3 580 1550 546 1090 1070 1045 1392 1328 867 568 1530 575 1087 1035 1053 1380 1312 904 570 1579 599 1085 1000 1066 1386 1299 889 The data could be read into R using the following code. oscil
2nd SEMESTER 2024/25 Group Assignment ECO310 - Econometrics of Time Series This assessment takes the form. of a group research project (7-8 students in each group, except for one group), with group membership randomly assigned. The coursework project is part of the assessment for the ECO310 module with a weight of 15% of the final mark for the module. Project Task Each project group is randomly assigned time series for the prices of 3 stocks sourced from DataStream, which contains closing prices from the periods of December 30, 2016 to December 30, 2022. Using appropriate time series econometric techniques, please address these questions in sequential order in your report. You are expected to present your report as if you are a team of investment analysts advising your clients. 1. Evaluate and comment on the trading performance of the three stocks during this period. You should use log-returns. 2. Does any of the series display serial correlation and unit root? If so, what does this mean to a retail investor who has zero knowledge of time series analysis? You should attempt to explain these concepts as intuitively as possible. 3. What is your forecast of the prices of all three stocks in the first three trading days of the year 2023? What are your expected returns if you hold an equally weighted portfolio of the 3 stocks? Hint: For each of the three series, develop a suitable ARIMA model and then use these to implement one-, two-, and three-step ahead out-of-sample forecasts. 4. Further, you should also employ appropriate forecast combination technique to develop a combined forecast model in forecasting the returns for each of the 3 stocks. Argue for the best forecasting model for each of the 3 stocks. 5. Finally, which stock is likely to be the main driver of performance, among the three stocks you are assigned? Explain your reasoning. Hint: You may find formal Granger causality tests, variance decomposition and impulse response function analyses to be useful. If further analyses are deemed relevant and can strengthen your arguments, you can add more information that may be peripheral but in support of your arguments. Submission and deadline Note that group mark is the same for all group members. However, if the majority of students within a group are in consensus that a particular student has not participated and contributed in the group project, he/she would receive 50% penalty from the group’s project mark. Each group should submit a report, with no less than 1000, but no more than 1200 words (excluding title page and Appendix), through LMO no later than 23:00, May 9 2025. In each group one group member submits the assignment on behalf of all the group members on the LMO. Detailed student names and numbers of all group members should therefore be included in the front title page of the report submitted. Late submissions policy: Late submissions will be penalized 10 marks for every working day past deadline. Late submissions will be accepted till +5 working days after the deadline. Backup: If for some reason submission through LMO fails, students can send their coursework to the module leader via e-mail:[email protected] Assessment The final mark will be based on the evaluation of the submitted report according to the following criteria (percentages out of total marks in parentheses): (i) Data & preliminary analysis (15 percent): Your report should have a clearly explained dataset with the indicated time period. The selection of the dataset and specific details such as handling missing observations etc., as well as the preliminary graphical analysis, should be included in the report. (ii) Methodology (15 percent): Econometrics methodology applied in every step should be correct, and explained clearly and sufficiently. Specifically, the rationale of applying each methodology should be justified given the empirical features observed in the dataset. (iii) Written report quality, applications, and presentation of results (50 percent): Overall, quality of the written report and the clarity of presentation of the results are very important criteria in the marking. The flows of the analyses implemented should be coherent, easy to follow, and make econometric senses. Any mistake made in terms of results’ interpretations will be penalized accordingly. Note that proper use of English grammar and vocabulary is important. (iv) Literature and References (10 percent): Literatures referred to should be discussed and relevant, and then properly cited and referred to. References styling must be consistent following the Harvard referencing style. The studies that are utilized should be briefly discussed and referenced. (v) Technical appendix (10 percent): All the E-Views output (or other computational tools deemed appropriate) involved in guiding the empirical analyses must be included in the Appendix part. These should be clearly labelled and put in order according to the use in the project report.