Purpose The purpose of this assignment is to become familiar with PyTorch and to build a neural network using PyTorch for Fashion MNIST data for image classification. You will build a Convolutional Neural Network (CNN) which is the most popular approach to extract features from an image without too much computation (compared to a fully completed layer). Objectives Learners will be able to: ● Gain a fundamental understanding of PyTorch, including its key concepts and functionalities ● Learn the architecture and principles of CNN models ● Build a CNN using PyTorch to efficiently extract features from the Fashion MNIST dataset. Technology Requirements ● GPU environment ● Jupyter Notebook ● Python3 (Python 3.8 and above) ● PyTorch ● Torchvision ● Numpy ● Matplotlib 1 Directions Accessing ZyLabs You will complete and submit your work through zyBooks’s zyLabs. Follow the directions to correctly access the provided workspace: 1. Go to the Canvas Assignment, “Submission: Convolutional Neural Networks with Pytorch Assignment” 2. Click the “Load Submission…in new window” button. 3. Once in ZyLabs, click the green button in the Jupyter Notebook to get started. 4. Review the directions and resources provided in the description. 5. When ready, review the provided code and develop your work where instructed. Assignment Directions Architecture We implement a Convolutional Neural Network as our model. We make use of the following layers in our model. ● a convolution layer for extracting features. ● batchnorm layer for normalizing the weights in the hidden layers. ● ReLU activation function for the non-linearity between layers. ● Finally fully connected layers in the end. Model We make use of the following convolutional neural network architecture for our dataset. ● convolution layer output_channels-16 kernel_size=3 stride=1 padding-1 ● batch normalization layer ● ReLU activation layer ● max pool layer kernel_size=2 stride=2 ● convolution layer output_channels-32 kernel_size=3 stride=1 padding-1 ● batch normalization layer ● ReLU activation layer ● max pool layer kernel_size=2 stride=2 ● convolution layer output_channels-64 kernel_size=5 stride=1 padding-2 ● batch normalization layer ● ReLU activation layer 2 ● max pool layer kernel_size=2 stride=2 ● fully connected layer – number_of_classes Required Tasks Build the Model ● First, define a class called Model inheriting from Pytorch’s nn.Module. ● In init(constructor), define all the layers that are used to build the model ● Define a forward function for a sequential model that takes in images as input and returns the predictions as output. All the functions are available in the PyTorch package. Read the documentation/source code for a better understanding. ● Convolutional layer: https://pytorch.org/docs/stable/nn.html#convolution-layers ● Batchnorm layer: https://pytorch.org/docs/stable/nn.html#normalization-layers ● Activation ReLU: https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity ● Max Pooling layer: https://pytorch.org/docs/stable/nn.html#pooling-layers ● Fully connected layer: https://pytorch.org/docs/stable/nn.html#linear-layers Initialize the CNN Model Define a loss criterion. In this assignment, use the cross-entropy loss between the predictions and ground truth to estimate the loss. ● CrossEntropyLoss – https://pytorch.org/docs/stable/nn.html#crossentropyloss Define an optimization strategy to update the weights. In this assignment, we use the most commonly used Adam optimizer from the PyTorch package. ● Adam – https://pytorch.org/docs/stable/optim.html#algorithms Training the Model The training loop is set in the following way: For every batch in the defined number of epochs ● Move the images and labels to the GPU by checking is_cuda ● Extract output by passing images through the model 3 ● pass the output and ground truth to the loss criterion for batch loss ● clear the gradients ● backpropagate (compute gradients w.r.t the parameters) using backward() ● update the parameters with a single optimization step ● update the training loss for plots repeat Testing the Classification In the testing loop, we don’t update the weights. The trained model is tested for all the samples in test data to compute the accuracy and observe how well the model generalizes to the unseen data. The testing loop is set in the following way: For every batch in the testing data ● Put the model in the evaluation mode and turn off the gradients ● Move the images and labels to the device available ● extract output from the model for the input ● compute the prediction class by choosing the one with the maximum probability in the predictions. ● Compare the prediction classes with true classes. ● calculate accuracy ● update test_loss for plots repeat Submission Directions for Project Deliverables Learners are expected to work on the assignment individually. Ideas and concepts may be discussed with peers or other sources can be referenced for assistance, but the submitted work must be entirely your own. You must complete and submit your work through zyBooks’s zyLabs to receive credit for the assignment: 1. To get started, use the provided Jupyter Notebook in your workspace. 4 2. All necessary datasets are already loaded into the workspace. 3. Execute your code by clicking the “Run” button in top menu bar. 4. When you are ready to submit your completed work, click on “Submit for grading” located on the bottom left from the notebook. 5. You will know you have completed the assignment when feedback appears below the notebook. 6. If needed: to resubmit the assignment in zyLabs a. Edit your work in the provided workspace. b. Run your code again. c. Click “Submit for grading” again at the bottom of the screen. Your submission score will automatically be populated from zyBooks into your course grade. However, the course team will review submissions after the due date has passed to ensure grades are accurate. Evaluation This assignment is auto-graded. There are a total of four (4) test cases. Each test case has points assigned to it. Please review the notebook to see the points assigned for each test case. A percentage score will be passed to Canvas based on your score.
Purpose The goal of this project is to explore and analyze code coverage using structural-based testing techniques. Objectives Learners will be able to: ● Analyze code coverage using statement and decision coverage techniques. ● Develop a set of test cases based on specified requirements. ● Learn about different data flow anomalies. Technology Requirements ● Java 17 or above Project Description In this assignment, you will focus on two vital aspects of software testing and code analysis. The project consists of two distinct parts, each geared towards honing your skills in specific areas. In Part 1, you will explore code coverage analysis for a Vending Machine application. You will learn how to assess code coverage using statement and decision coverage techniques, enabling you to identify untested areas and enhance the test suite’s efficiency. In Part 2, the focus shifts to data flow analysis in Static Analysis code. You will develop the ability to detect data flow anomalies, such as data leaks and uninitialized variables. Directions Download the provided files located in the assignment page, then carefully review the directions before starting your work: 1 ● VendingMachine.java ● StatisticAnalysis.java Part 1: You are expected to research and experiment with a tool that provides statement and decision code coverage. You are given a Java file called VendingMachine.java and you are asked to develop a set of test cases for this code based on the following requirements: ● Takes in an integer input ● Allows users to select between three products: Candy (20 cents), Coke (25 cents), Coffee (45 cents) ● Returns the selected product and any remaining change ● If there is not enough money to buy the product, displays the amount necessary to buy the product and other products to purchase. You are expected to execute the program with your test cases, observe and report out the code coverage of your test cases. The goal is to reach 100% in statement and 90% in decision coverage. You should be changing your test cases until you reach the desired coverage Part 2: You are expected to research and experiment with a static source code analysis tool. You are given a Java file called StaticAnalysis.java which contains two different data flow anomalies. The inputs to this code are: ● the weight of the package as an integer ● the length of the package as an integer ● the type of product as a String You are expected to execute the tool you selected and analyze the report generated for StaticAnalysis.java code. You need to analyze the findings of the tool on this code and comment on how well the tool has performed in identifying the two built in anomalies. FInally, you are expected to assess the tool in your own words in terms of: ● Features and functionalities provided by the tool CSE 565: Software Verification and Validation 2 Structural-Based Testing Assignment ● Type of anomalies covered by the tool ● Ease of use Submission Directions for Project Deliverables You will submit all project deliverables compiled into one (1) report in PDF format. Title your file as yourlastname_firstname_CSE 565_StructuralBasedTesting and submit at the assignment submission page. A single report in APA/MLA style should include Part 1 and Part 2 consisting of all the screenshots of test cases, coverage and detailed explanations about what the screenshots represent. Part 1 It should include: 1. A description of the tool used and the type of code coverage it provides 2. Description of the set of test cases developed including the screenshots of the code written 3. Report and discuss the coverage achieved for the test cases executed by including screenshots showing the tool’s coverage Part 2 It should include: 1. Description of the tool used and the types of analysis it provides 2. Description and analysis of the anomalies detected by the tool, showing screenshot of the report generated 3. Your assessment of the tool in your own words in terms of: • Features and functionalities provided by the tool • Type of anomalies covered by the tool • Ease of use Submission Guidelines You may submit your deliverables as many times as needed. However, only the most recent submission will be graded. You must submit your Project 2 file in the designated submission space. Learners may not email or use other means to submit any project for review, including feedback, and grading. CSE 565: Software Verification and Validation 3 Structural-Based Testing Assignment Evaluation You will be evaluated on the criteria for each part (worth 100 points total): Part 1 1. Description of the tool and type of coverage (10 points) – Description of the tool used and the types of coverage it provides 2. Development and description of a set of test cases (20 points) – Test cases covering all code coverage requirements. You are expected to include a screenshot showing the test cases developed for achieving the required coverage. 3. Reporting out and discussion of the code coverage (20 points) – You are expected to include a report and discuss coverage achieved by the executed test cases by including screenshots showing code coverage achieved by the test cases Part 2 1. Description of the tool and type of anomalies covered (10 points) – Description of the tool used and the types of anomalies it covers 2. Description and analysis of the anomalies detected by the tool (20 points) – Description and analysis of the anomalies detected by the tool, showing screenshot of the report generated 3. Assessment of the tool (20 points) – Your are expected to evaluate the tool in terms of: ● Feature and functionalities ● Coverage provided ● Ease-of-use CSE 565: Software Verification and Validation 4 Structural-Based Testing Assignment
Purpose Through this project, you will demonstrate your understanding of DOE principles, your ability to utilize appropriate tools, your proficiency in generating effective pairwise combination test cases for mobile application, and critically analyzing the effectiveness of using an AI-powered language model in DOE software testing. Objectives Learners will be able to: ● Apply design of experiments to develop tests ● Research and identify a suitable tool for DOE testing ● Develop pairwise combination test cases by using two (2) different approaches : using the DOE tool and utilizing an AI based tool (e.g. ChatGPT) ● Analyze the effectiveness of AI-powered language models for testing in comparison with traditional testing methods. Technology Requirements ● ChatGPT ● DOE tool Project Description This assignment is focused on deriving test cases from an advanced specification-based testing technique called Design of Experiments (DOE) which takes a more systematic approach, employing statistical methods to design test cases and comprehensively analyze the effects of multiple input factors on the system’s behavior. 1 Directions You are asked to develop two (2) different sets of test cases using pairwise combination Design of Experiments (DOE) technique by employing the use of generative AI, and a DOE tool of your choice. Below is the specification of the application that you should use when creating the test cases. Assume that your team has created a new mobile application which requires interaction with a customer to collect user input. The application will require a customer to enter five (5) different types of user inputs with two to five different options. Below is the table showing the specification of the mobile application Type of Phone Authenticatio n Connectivity Memory Battery Level iPhone 14 Fingerprint Wireless 128 GB < 20 % iPhone 13 Face recognition 3G 256 GB 20 – 39% Galaxy Z Text Password 4G LTE 512 GB 40 – 59% Huawei Mate 5G Edge 1 TB 60 – 79% Google Pixel 7 80 – 100% Task 1 You are asked to develop a set of test cases using pairwise combination DOE technique using a DOE tool. You are expected to research and identify a DOE tool that you will be using to create pairwise DOE test cases for testing the application that is specified above. Task 2 You are asked to develop a set of test cases using pairwise combination DOE technique using a generative AI tool. CSE 565: Software Verification and Validation 2 Design of Experiments Assignment You are expected to research and identify a generative AI tool (e.g. ChatGPT) that you will be using to create pairwise DOE test cases for testing the mobile application. Task 3 You are asked to compare and contrast the two sets of test cases created in Task 1 and Task 2 and analyze how well they perform in creating test cases to validate the mobile application. Task 4 You are asked to assess the effectiveness of the DOE tool in creating test cases using pairwise combination DOE technique. Your assessment should explain your experience with the tool and discuss its performance in terms of the following topics: ● Features and functionalities of the tools ● Scope covered by the tool ● Performance of the tool ● Ease of use Task 5 You are asked to assess the effectiveness of the generative AI tool in creating test cases using pairwise combination DOE technique. Your are asked to: ● Describe your experience with the AI tool in terms of your experience in creating prompts and processing the results generated by the prompts ● Discuss the significance of using generative AI in software testing using DOE technique Submission Directions for Project Deliverables This assignment requires submission of one (1) deliverable: You will submit a report in PDF format including: 1. Explanation of the test cases by DOE tool – The report should describe the test cases created by the DOE tool with evidence (i.e. screenshots) showing the test cases created by the tool 2. Explanation of the test cases by generative AI tool – The report should describe the test cases created by the generative AI tool, the prompt you entered when interacting with the AI tool and the result generated by the AI tool. Include screenshots of the prompts and results to further explain. 3. An assessment of the test cases – The report should include an assessment of the two sets of test cases developed by the DOE tool and the AI tool in terms of their validity and discuss how well the test cases align with the guidelines of DOE technique. CSE 565: Software Verification and Validation 3 Design of Experiments Assignment 4. An assessment of the DOE tool – In your own words, assess the effectiveness of the DOE tool in creating test cases using pairwise combination DOE technique. You are expected to explain your experience with the tool and discuss its performance in terms of: ○ Features and functionalities ○ Scope covered by the tool ○ Performance of the tool ○ Ease of use 5. An assessment of the generative AI tool – In your own words, assess the effectiveness of the AI tool that you used in creating test cases using pairwise combination DOE technique. On the basis of your findings, you are expected to explain your experience with the tool and discuss the significance of using generative AI in software testing using DOE technique. Different DOE tools and generative AI can give different results. You need to evaluate them and see which one is better for the given task, and use that tool. When ready, title your file as yourlastname_firstname_CSE 565_DOEAssignment and submit at the assignment submission page. You are expected to submit ● a file in PDF format Submission Guidelines You may submit your deliverables as many times as needed. However, only the most recent submission will be graded. You must submit your assignment file in the designated submission space. Learners may not email or use other means to submit any project for review, including feedback, and grading. Evaluation Your assignment will be evaluated based on the criteria (worth 100 points total): 1. Explanation of the test cases by DOE tool (10 points) 2. Explanation of the test cases by generative AI tool (10 points) 3. An assessment of the test cases (20 + 20 points) 4. An assessment of the DOE tool – (20 points) 5. An assessment of the generative AI tool – (20 points) CSE 565: Software Verification and Validation 4 Design of Experiments Assignment
This project will help you reinforce the topics discussed in the class including categories of testing, equivalence partitioning, boundary value, and cause and effect testing. Objectives Learners will be able to: ● Apply equivalence partitioning, boundary value and cause-effect testing techniques. Technology Requirements ● Java 17 or above Project Description In this project, you will gain hands-on experience in software testing methodologies. You will focus on three essential techniques: ● Equivalence Partitioning: Learn how to group inputs into classes and design test cases that cover each class efficiently. ● Boundary Value Analysis: Explore the edges of input ranges to uncover potential defects and create precise test cases. ● Cause and Effect Testing: Identify input conditions and their corresponding effects to build targeted test scenarios. Throughout the project, you will apply these techniques to a provided software program, developing a set of well-structured test cases. By executing these tests, you will validate the program’s functionality and ensure it meets the desired specifications. 1 To complete this project, you will use equivalence partitioning, boundary value, and cause/effect analysis techniques to develop a set of test cases for the provided program and execute the tests. Directions Download the provided files located in assignment description page, then carefully review the directions before starting your work: ● SpecificationBasedTesting_Test Case Spreadsheet.xlsx ● Jar Files and Project1Script.zip In this assignment, we have developed a database to gather information and provide discounts about a product and its users depending on a variety of factors given. The jar file with the program is given to you, but contains at least 10 seeded defects. Using equivalence partitioning, boundary value, and cause/effect analysis techniques discussed in the lecture, please develop a set of test cases for this program and execute the tests following the instructions below: Running Jar Files There are three options to run your test cases for black box testing. Please make sure to have java installed on your computer to run the jar file. Using the GUI ● project1GUI.jar opens up the GUI where you can manually input your test cases and view the output ● To run: Open in Command Prompt or Terminal and enter “java -jar project1GUI.jar” and the GUI will open up Using Terminal or Command Prompt ● CSE565P1.jar contains the program that runs your test cases ● The jar file takes in arguments corresponding to ○ i.e.: java -jar 24 Returning Silver Spring Electronics 5 ○ The only values that will be accepted for User Status, Rewards Member Status, Season Bought, Product Category are below. ■ User Status CSE 565: Software Verification and Validation 2 Specification-Based Testing Assignment ● New or Returning ■ Rewards Member Status ● Bronze, Silver or Gold ■ Season Bought ● Winter, Spring, Summer or Fall ■ Product Category ● Unknown or Electronics ○ To run: Open in Command Prompt or Terminal and enter “java -jar CSE565P1.jar Peter 24 Returning Silver Spring Electronics 5” and the output will be given Running tests automatically (Linux/ Bash for Windows/Mac) ● CSE565P1.jar contains the program that runs your test cases. ● Project1Script is a bash script where you can input your test cases, and Project1Script will automatically run them. ● To add a test case and run: a. Open Project1Script in vim or an editor b. Add in a test case by doing: java -jar Peter 24 Returning Silver Spring Electronics 5 c. Exit out of vim or the editor d. If needed, open in Bash or Terminal and make Project1Script an executable by running “chmod u+x Project1Script” e. In the Bash or Terminal, run “./Project1Script” and the output for each test case will be given Requirements Input requirements 1. User Name (string) 1.1. Can be 5-10 characters long 1.2. Only contain characters (no numbers) 1.3. Cannot contain a hyphen (-) or underscore (_) 2. Age (Integer) 2.1. Must be 18 or over CSE 565: Software Verification and Validation 3 Specification-Based Testing Assignment 3. User Status (do not test for invalids) 3.1. New or Returning 4. Rewards Member Status (do not test for invalids) 4.1. Bronze, Silver or Gold 5. Season Bought (do not test for invalids) 5.1. Winter, Spring, Summer or Fall 6. Product Category (do not test for invalids) 6.1. Unknown or Electronics 7. Rating of Product (Integer) 7.1. Rating must be between 1 – 10 Output requirements 8. Error messages for incorrect username, age, and rating. 9. Discounts Given based on: 9.1. New users cannot be given any discounts regardless of reward status, product, and season 9.2. If product category is unknown, user gets no discount 9.3. If bought in Summer: 9.3.1. Returning users who are gold members who bought Electronics get a 15% discount voucher 9.4. If bought in Spring: 9.4.1. Returning users who are bronze members who bought Electronics get a 10% discount voucher 9.5. If bought in Winter: 9.5.1. Returning users who are gold members who bought Electronics get a 25% discount voucher 9.6. If bought in Fall: 9.6.1. Returning users who are silver members who bought Electronics get a 15% discount voucher CSE 565: Software Verification and Validation 4 Specification-Based Testing Assignment 9.7. Any other combination of returning user, reward status, product, and season does not get a discount. Remember in equivalence partitioning, you can only test one (1) invalid partition at a time. Preparing the Deliverables Follow the steps to accurately complete this assignment: 1. Complete test case matrix (Tab 1 at the bottom of the spreadsheet) from SpecificationBasedTesting_Test Case Spreadsheet.xls 2. Execute test cases following one of the three options detailed above 3. Report results in test case matrix tab (Pass / Fail). Highlight Failed test cases with RED color. 4. Complete the defect tracking sheet (Tab 2 at the bottom of the spreadsheet) containing the defect number, a brief description of each defect, and the requirement that the defect maps to. Submission Directions for Project Deliverables This assignment requires submission of one (1) deliverable: Submit the Excel file containing the SpecificationBasedTesting_Test Case Spreadsheet. When ready, title your file as yourlastname_firstname_CSE 565_SpecificationBasedTestingAssignment and submit to the assignment submission page. Submission Guidelines You may submit your deliverable as many times as needed. However, only the most recent submission will be graded. You must submit your assignment file in the designated submission space. Learners may not email or use other means to submit any project for review, including feedback, and grading. Evaluation Your submission will be evaluated based on the criteria (worth 80 points total): CSE 565: Software Verification and Validation 5 Specification-Based Testing Assignment 1. Set of complete test cases (40 points): included all test cases (passing and failing test cases as per the requirements) 2. Defects found and described (30 points): identified at least 10 defects and described them 3. Requirements and test cases mapped to defects (10 points): mapped 10 defects CSE 565: Software Verification and Validation 6 Specification-Based Testing Assignment
The idea of this project is to use the Android Application created in Assignment 1 to upload handwritten digit pictures to the server and use the server to classify the digits. The digits should be placed in their respective folders after the classification of the images. Project 2 will be a team project. Everyone is expected to submit a video of the application working along with the source code. Deliverables: 1) Mobile Application: The mobile application will be exactly similar to Assignment 1 except that there will not be any category dropdown list for this one. 2) Server Side: This time you will need to train a basic deep-learning framework from scratch on the MNIST dataset to classify different handwritten digits. Once you have trained the deep learning network on the dataset you will use the trained model to classify the images as they are uploaded from the application. The classified images will then need to be stored in their respective folder on the server. Please download the dataset to train your model. For grading purposes, we will be using randomly clicked images of handwritten digits. Your grades will be proportional to the accuracy of the model. TASKS: 1. Modify the Mobile Application 2. Modify the server code to accept the picture without the image category 3. Decide on a ML Pipeline 4.Identify the dimensions of image 5.Identify the type of Deep Learning framework that needs to be used 6.Build the Deep Learning Model from scratch 7.Load the MNIST dataset 8.Split the dataset to train and validation 9.Preprocess the dataset 10.Train the model using the dataset 11.Validate the trained model 12.Check the weights of the Model and perform fine tuning if required 13.Store the trained model 14.Load the trained model for testing 15.Check the accuracy of the model 16.Modify server to the call the test function 17.Integrate all the components of the server 18.Check if the whole application is integrated seamlessly 19.Provide some random handwritten digits and see the classification 20.Report the accuracy of the classification. 21.Check if the classified images are being stored in the respective folders 22.Write the working of the application in the report format and create a video of the working of the application 23.Put all the code and report in pdf format in a zip file 24. Submit the zipped file 25.Prepare the application for the seamless working demo Submission: 1) Source Code of both Application and server. 2) Video of a working demonstration of the application and the server. 3) A 1-2 page report explaining the technical workings of your application. All the above things need to be zipped together and uploaded on Canvas.Important Dates for Project 1:Notes: We will be taking a Zero Tolerance Policy toward Plagiarism. So, please submit only your original work. Violations of the University’s Academic Integrity policy will not be ignored. Penalties include reduced or no credit for submitted work, a failing grade in the class, a note on your official transcript that shows you were punished for cheating, suspension, expulsion, and revocation of already awarded degrees. The university’s academic integrity policy can be found at https://provost.asu.edu/academic-integrity. Thank You
Purpose Through this project, you will demonstrate your understanding of unit level testing, your ability to utilize appropriate tools, your proficiency in generating effective unit level test cases, and critically analyzing the effectiveness of using an AI-powered language model in unit level testing. Objectives Learners will be able to: ● Apply unit level testing learning to develop tests for an algorithm ● Develop unit level test cases using a generative AI tool and unit testing framework ● Execute the test cases generated and report the performance of unit level test cases developed and further improve the unit test cases ● Analyze the effectiveness of generative AI tool in performing unit level testing Technology Requirements ● Generative AI tool (e.g. ChatGPT) ● Unit Testing framework (of choice) ● IDE (of choice) Project Description This assignment is focused on deriving test cases for an algorithm using a unit testing framework. Directions You are asked to develop test cases using a unit testing framework of your choice, with feedback from a generative AI tool. 1 Task 1 You are expected to download or write code to implement Heapsort algorithm. You can use Java, C++, Python, or Javascript language to implement the code. Task 2 You are expected to: ● research and identify a unit testing framework to test your code created in Task 1. ● research and use a generative AI tool to create unit level test cases using a unit testing framework. ● Identify and report the prompts used to generate the test cases when interacting with the generative AI tool and the results generated by the AI tool. Task 3 You are asked to execute the test cases generated in Task 2 in an IDE of your preference and report on the output of the test cases with evidence (i.e. screenshots) showing the test execution and test results. Task 4 You are asked to assess the validity of the test cases executed and further improve the test cases by updating them. The test cases developed by the AI tool may not be sufficient to test the code written. Based on the coverage and performance of the test cases, describe how you can update them and further improve them. Task 5 You are expected to make an assessment of the generative AI tool as per your experience in using it in this assignment. You are asked to assess the effectiveness of generative AI tool in creating test cases using the unit level testing framework. Your assessment should explain your experience with the AI tool and discuss its performance in terms of how well they generate the unit level test cases. Submission Directions for Project Deliverables This assignment requires submission of one (1) deliverable: You will submit a report in PDF format including: CSE 565: Software Verification and Validation 2 Design of Experiments Assignment 1. Development of an algorithm – The report should include a screenshot of the code developed 2. Explanation of the unit testing framework and prompt generation – The report should describe the unit testing framework and the generative AI tool identified. It should include screenshots of the prompts tried and used for creating unit level testing of the code. 3. Explanation of the test cases by AI tool – The report should describe the test cases created by the generative AI tool, should have the result generated by the AI tool with screenshots of the results. 4. Report out of test case execution – The report should describe the results of the test execution of the test cases on a selected IDE. It should include screenshots showing the execution results. 5. Assessment and further improvement of test cases – The report should include an assessment of the validity of the test cases and a discussion of how they can be further improved. It should include screenshots of new test cases developed by the student and results showing the execution of the updated test cases. 6. An assessment of the generative AI tool – The report should describe the student’s experience in using the AI tool in this project and should include an assessment of the AI tool’s performance in terms of how well they generate the unit level test cases using the given unit test framework. When ready, title your file as yourlastname_firstname_CSE 565_UnitTestingAssignment and submit at the assignment submission page. You are expected to submit ● a file in PDF format Submission Guidelines You may submit your deliverables as many times as needed. However, only the most recent submission will be graded. You must submit your assignment file in the designated submission space. Learners may not email or use other means to submit any project for review, including feedback, and grading. Evaluation Your assignment will be evaluated based on the criteria (worth 100 points total): 1. Development of an algorithm (5 points) 2. Explanation of the unit testing framework and prompt generation (10 points) 3. Explanation of the test cases generated by the AI tool (20 points) 4. Report out of test case execution (20 points) CSE 565: Software Verification and Validation 3 Design of Experiments Assignment 5. Assessment and further improvement of test cases (25 points) 6. An assessment of the generative AI tool (20 points) CSE 565: Software Verification and Validation 4 Design of Experiments Assignment
The idea of this project is to create an Android Application that lets users click any picture and upload the image to a server. As already mentioned, this project has two parts, 1) Building the Android Application with which the users can click the picture and select the category of the image, and 2) Creating a server that will help store the application based on the category selected. Project 1 will be individual. Everyone is expected to submit a video of the android application working along with the source code. This application will be reused for Project 2 and Project 3. Deliverables: 1) Mobile Application: Develop an android or iOS application that opens the back camera and lets the user click an image. This application then sends the image over to the server along with the category information to store in the server. The application should have at least two pages. Page1 should ask the user if the user wants to click a picture or not. The second page should ask the user to input the category of the image the user just clicked from a dropdown item list. This page should also have an upload button that sends the image and the information to the server. 2) Server Side: Develop your local server on your computer. It is recommended that you use Flask to build your local server. The server needs to store the images in the respective folders of each image category. Submission: 1) Source Code of both Application and server. 2) Video of working demonstration of the application and the server. 3) A 1-2 pages report explaining the technical workings of your application. All the above things need to be zipped together and uploaded on Canvas.Important Dates for Project 1: Notes: We will be taking a Zero Tolerance Policy toward Plagiarism. So, please submit only your original work. Violations of the University Academic Integrity policy will not be ignored. Penalties include reduced or no credit for submitted work, a failing grade in the class, a note on your official transcript that shows you were punished for cheating, suspension, expulsion, and revocation of already awarded degrees. The university’s academic integrity policy can be found at https://provost.asu.edu/academic-integrity. Thank You
Purpose In this project, you will extract several performance metrics of an Artificial Pancreas system from sensor data. Objectives Learners will be able to: ● Extract feature data from a data set. ● Synchronize data from two sensors. ● Compute and report overall statistical measures from data. Technology Requirements ● Python 3.11 ● scikit-learn 1.3.2 ● pandas 1.5.3 ● numpy 1.26.3 ● scipy 1.11.4 Project Description In this project, we are considering the Artificial Pancreas medical control system, specifically the Medtronic 670G system. The Medtronic system consists of a continuous glucose monitor (CGM) and the Guardian Sensor (12), which is used to collect blood glucose measurements every 5 minutes. The sensor is single-use and can be used continuously for 7 days after which it has to be replaced. The replacement procedures include a recalibration process that requires the user to obtain blood glucose measurements using a Contour NextLink 2.4 glucosemeter ®. 1 Note that this process also requires manual intervention. The Guardian Link Transmitter® powers the CGM sensor and sends the data to the MiniMed 670G® insulin pump. The insulin pump utilizes the Smart Guard Technology that modulates the insulin delivery based on the CGM data. The SmartGuard Technology uses a Proportional, Integrative, and Derivative controller to derive small bursts of insulin also called Micro bolus to be delivered to the user. During meals, the user uses a BolusWizard to compute the amount of food bolus required to maintain blood glucose levels. The user manually estimates the amount of carbohydrate intake and enters it into the Bolus Wizard. The Bolus Wizard is pre-configured with the correction factor, body weight, and average insulin sensitivity of the subject, and it calculates the bolus insulin to be delivered. The user can then program the MiniMed 670G infusion pump to deliver that amount. In addition to the bolus, the MiniMed 670G insulin pump can also provide a correction bolus. The correction bolus amount is provided only if the CGM reading is above a threshold (typically 120 mg/dL) and is a proportional amount with respect to the difference between the CGM reading and the threshold. The SmartGuard technology has two methods of suspending insulin delivery: a) Suspend on low, where the insulin delivery is stopped when the CGM reading is less than a certain threshold, or b) suspend on predicted low, where the insulin delivery is stopped when the CGM reading is predicted to be less than a certain threshold. Apart from these options, insulin delivery can also be suspended manually by the user or can be suspended when the insulin reservoir is running low. Accessing Ed Lessons You will complete and submit your work through Ed Lessons. Follow the directions to correctly access the provided workspace: 1. Go to the Canvas Assignment, “Submission: Extracting Time Series Properties of Glucose Levels in Artificial Pancreas Project” 2. Click the “Load Submission…in new window” button. 3. Once in Ed Lesson, select the assignment titled “Submission: Extracting Time Series Properties of Glucose Levels in Artificial Pancreas Project”. 4. In the code challenge, first review the directions and resources provided in the description. 5. When ready, start working in the Python file titled “main.py” 2 Directions Dataset: You will be given two datasets: 1. From the Continuous Glucose Sensor (CGMData.csv) and 2. from the insulin pump (InsulinData.csv) The output of the CGM sensor consists of three columns: 1. Data time stamp (Columns B and C combined), 2. the 5-minute filtered CGM reading in mg/dL, (Column AE) and 3. the ISIG value which is the raw sensor output every 5 mins. The output of the pump has the following information: 1. Data time stamp, 2. Basal setting, 3. Micro bolus every 5 mins, 4. Meal intake amount in terms of grams of carbohydrate, 5. Meal bolus, 6. correction bolus, 7. correction factor, 8. CGM calibration or insulin reservoir-related alarms, and 9. auto mode exit events and unique codes representing reasons (Column Q). The bold items are the columns you will use in this project. Metrics to be extracted: 1. Percentage time in hyperglycemia (CGM > 180 mg/dL), 2. percentage of time in hyperglycemia critical (CGM > 250 mg/dL), 3. percentage time in range (CGM >= 70 mg/dL and CGM = 70 mg/dL and CGM
Purpose In this project, you will apply the cluster validation technique to data extracted from a provided data set. Objectives Learners will be able to: ● Develop code that performs clustering. ● Test and analyze the results of the clustering code. ● Assess the accuracy of the clustering using SSE and supervised cluster validity metrics. Technology Requirements ● Python 3.11 ● scikit-learn 1.3.2 ● pandas 1.5.3 ● numpy 1.26.3 ● scipy 1.11.4 ● matplotlib 3.8.2 Project Description For this project, you will write a program using Python that takes a dataset and performs clustering. Using the provided training data set you will perform cluster validation to determine the amount of carbohydrates in each meal. Please watch the Cluster Validation Project introductory video before beginning. This is located in Ed Lessons before the project’s code challenge. Note: Project details in the Overview Document were recently updated since the recording of the videos, so some directions or items may not match. Please follow the Overview Document directions to complete your project correctly. 1 Directions Accessing Ed Lessons You will complete and submit your work through Ed Lessons. Follow the directions to correctly access the provided workspace: 1. Go to the Canvas Assignment, “Submission: Cluster Validation Project” 2. Click the “Load Submission…in new window” button. 3. Once in Ed Lesson, select the assignment titled “Submission: Cluster Validation Project”. 4. In the code challenge, first review the directions and resources provided in the description. 5. When ready, start working in the Python file “main.py” Project Directions There are two main parts to the process: 1. Extract features from Meal data 2. Cluster Meal data based on the amount of carbohydrates in each meal Data: Use the Project 1 data files: ● CGMData.csv ● InsulinData.csv Step 1: Extracting Ground Truth: 1. From InsulinData.csv, take column Y (BWZ Carb Input(grams)) and get all the meal intake data, and derive the min and max values. 2. Discretize the meal amount in bins of size 20. In total, you should have n = (max-min)/20 bins. 3. Consider each row in the Meal Data Matrix (P x 30) that you generated in Project 2. 4. Put them in the respective bins according to their meal amount label. This will be your Ground Truth 2 Step 3: Performing clustering: Use the features in your Project 2 to cluster the meal data into n clusters. Use DBSCAN and KMeans. Step 4: Compute SSE For each of the clusters, compute SSE values and combine them to get one SSE value for both KMeans and DBSCAN. Step 5: Calculate the Entropy and Purity 1. You need to create two matrices one for KMeans and the other for DBSCAN which contain bins (b1,b2…bn) as the columns and clusters (C1, C2 …Cn) from KMeans and DBSCAN as rows 2. Populate the matrix by combining the cluster values that fall in the respective bins. 3. Calculate the Entropy and Purity using the formulas provided in the video. Expected Output: A Result.csv file which contains a 1 X 6 vector. The vector should have the following format: SSE for Kmeans SSE for DBSCAN Entropy for KMeans Entropy for DBSCAN Purity for KMeans Purity for DBSCAN The Result.csv file should not have any headers, just the six values in six columns. Submission Directions for Project Deliverables This project will be auto-graded. You must complete and submit your work through Ed Lesson’s code challenges to receive credit for the course: 1. To get started, use the “main.py’ file provided in your workspace. 2. All necessary datasets are already loaded into the workspace. 3. Execute your code by running the “python3 main.py” command in the terminal to test your work. 3 4. When you are ready to submit your completed work, click on “Test” at the bottom right of the screen. 5. your work, submit it for auto-grading by clicking the “Test” button. 6. You will know you have completed the assignment when feedback appears for each test case with a score. 7. If needed: to resubmit the assignment in Ed Lesson a. Edit your work in the provided workspace b. Execute your code again by running the commands in the terminal c. Click “Test” at the bottom of the screen 8. Once you have finished working on the project, please submit it by clicking the “Submit” button at the top right corner of your submission space. Your submission will be reviewed by the course team and then, after the due date has passed, your score will be populated from Ed Lesson into your Canvas grade. Note: 1. Do not change the code file name; it must remain ‘main.py’ for the auto-grader to recognize your submission. 2. When the auto-grader runs your Python file, it should generate a ‘Result.csv’ file with the specified format. The ‘Result.csv’ file should not include any headers and should only contain the metrics in a 1 x 6 matrix. 3. Avoid using absolute paths when accessing other files. 4. Before submitting to the grader, it is recommended to run your code file via the terminal to catch any potential runtime errors. Evaluation The autograder will evaluate your code based on the following criteria: ● 100 points: For developing a code in Python that takes the dataset and performs clustering. ● 40 points: For developing a code in Python that implements a function to compute SSE, entropy, and purity metrics. These two can be written in the same file. ● 60 points: Evaluate the supervised cluster validation results obtained by your code.
Battleboat is a probability-based boardgame that challenges the user to locate enemy boats hidden on a rectangular grid. The purpose of the game is to locate and destroy every enemy boat in the least number of guesses. You will be modelling this game in Java. You must implement every part of the description below. This means you must follow and specified modifiers and signatures exactly as they are specified. Examples of changes that could lose you points: changing a variable to public when it is listed as private, changing the return type of a function, etc. Your code will also be judged based on style. This should not be a stressful requirement – it simply means that you must create logically organized, well-named and well-commented code so the TAs can grade it. IMPORTANT: You cannot import anything to help you complete this project. The only exception is importing Scanner to handle the I/O. Note that you do not have to explicitly import Math because Java already does it for you. In other words, you can use Math methods without importing Math Note: Writing useful helper methods for the functions listed above is allowed and encouraged! Remember part of your grade comes from style and readability of the code you write. Note: You will find it useful to write your own tests to prove to yourself that your code works. Please include any test classes you write with your submission. 1 Cell Class The Battleboats game board is composed of many cells. You are required to create a Cell class that contains the following private attributes: • private int row: indicates the row value of the Cell • private int col: indicates the column value of the Cell • private char status: character indicating the status of the Cell. There are four different possibilities for this field: 2 PROJECT 1 2. BATTLEBOAT CLASS Character Conditions ’ ’ (space) Has not been guessed, no boat present ’B’ Has not been guessed, boat present ’H’ Has been guessed, boat present ’M’ Has been guessed, no boat present In addition, you are required to implement the following functions: • public char get_status(): getter method for status attribute • public void set_status(char c): setter method for status attribute • public Cell(int row, int col, char status): Cell class constructor 2 Battleboat Class A Battleboat object will have the following attributes: • private int size: indicates the number of Cell objects a Battleboat spans. Default this value to 3 • private boolean orientation: indicates the orientation of the Battleboat (horizontal or vertical, can be randomly decided) • private Cell[] spaces: array of the Cell objects associated with the Battleboat And the following functions: • public boolean get_orientation(): getter method for orientation attribute • public boolean get_size(): setter method for size attribute • public boolean get_spaces(): setter method for spaces attribute • public Battleboat(): Battleboat class constructor Hint: To generate random numbers, the Math.random() method can be used. However, this method returns a double in the range 0 to 1. We will need to scale this and then round it to a whole. To do this, use the Math.floor(x) function, which takes a double x and rounds it down to the nearest integer. For example, Math.floor(2.9) is 2. 3 PROJECT 1 3. BOARD CLASS 3 Board Class The computer will simulate a rectangular m × n board. A Board object will contain the following: • private int num_rows: indicates the number of rows a Board has • private int num_columns: indicates the number of columns a Board has • private int num_boats: indicates the number of Battleboat objects a Board has • private Battleboat[] boats: array of all the Battleboat objects associated with a Board object • private Cell[][] board: 2-dimensional Cell array required to represent the Board • private boolean debugMode: flag to indicate if Board should be printed in debugMode The minimum Board size is 3 × 3 and the maximum is 12 × 12. Assume that the points in the Board range from (0, 0) to (m − 1, n − 1) inclusive. Each boat is represented by a line of consecutive squares on the board. Boats may not overlap other boats, extend outside the game board, or be placed diagonally. They may be horizontal or vertical. A boat is considered “sunk” when all the squares of the boat have been “hit” by the user. Examples: Valid coordinates for a boat of size 3 are {(0, 0),(0, 1),(0, 2)} and {(1, 1),(2, 1),(3, 1)}. Examples of invalid coordinates are {(0, 0),(0, 1)}, which is invalid because there are not enough points. {(0, 0),(0, 2),(0, 3)} is invalid because the points are not consecutive. {(−1, 0),(0, 0),(1, 0)} is invalid because the first coordinate is out of bounds. Finally, two boats cannot contain the same point because they cannot overlap. After the gameboard has been sized, the program should place boats randomly on the board. This requires randomly generating a coordinate (x, y) where the boat will be placed as well as randomly choosing whether the boat should be horizontal or vertical. The quantity of boats is defined by the width and height of the game board. As mentioned prior, all boats should be of length 3. Recall the smallestBoardis 3 × 3 and the largestBoardis 12 × 12. Smallest Dimension Boat Quantity width == 3 or height == 3 1 3 < width
The problem. The slider puzzle is played on a 3-by-3 grid with 9 square blocks labeled 1 through 8 (with one blank). Your goal is to rearrange the blocks so that they are in order. You are permitted to slide blocks horizontally or vertically into the blank spot. The following shows a sequence of legal moves from an initial board position (left) to the goal position (right) by first moving 5 UP into the blank spot and then moving 8 LEFT into the blank spot. initial goal If you want to try your skills, go to: https://mypuzzle.org/sliding. Part 1 How to Make Moves Use the code for the board which is provided (or code of your own). Write the “ShowMe” method which takes a board and a sequence of moves and shows the boards that result from those moves. Sample Output: OriginalBoard Moves LDRUL 1 2 3 4 6 8 7 0 5 L==> 1 2 3 4 6 8 7 5 0 D==> 1 2 3 4 6 0 7 5 8 R==> 1 2 3 4 0 6 7 5 8 U==> 1 2 3 4 5 6 7 0 8 L==> 1 2 3 4 5 6 7 8 0 Part 2 Solving the Problem For this problem, you are writing a program to SOLVE the puzzle, not to play the puzzle. From a specific board, you are to illustrate an efficient solution. The file Prog1OUT.txt shows you an example output. The Intelligence One way: You could just randomly try moves, each time checking if you have found a solution. While this would likely work eventually, it could take a while. Our way: We are going to be more methodical. Instead of randomly picking a particular move, we say, “I have four (or fewer) possible moves. I don’t know which one to try. I want to try ALL of them. Let me try each of them and then (later) pick the one that was the best.” We try all choices from each board we reach. We use a queue to remember all of the partial games we are considering. A logical tree helps us consider the possibilities. This is a LOGICAL tree, not a coded tree. Starting at “Game 0” you can make one of four moves (UP, DOWN, LEFT, RIGHT). [We would say this is a tree with a four way branching factor.] We can keep expanding nodes until a “leaf” is the desired (goal) game. That represents an exhaustive search. However, we haven’t said which order we expand the nodes. Are we going to keep going deeper in the tree [a depth first search]? This has the advantage of being easy to do (as recursion takes care of the returning to a previous node), but has some disadvantages. We are going to approach it slightly differently. Instead of doing one move after another – we are going to consider all boards reachable with one move, then all boards reachable with two moves, etc. So in the picture below, we consider Game 0, then Game 1a, Game 1b, Game 1c, Game 1d, and then all the games at the next level. While some original games have four “next games”, other games have fewer (due to the position of the blank and because we don’t want to undo the previous move). This is like traversing the logical game tree “by level”. This is a called a breadth-first traversal. We examine all one move states, then all two move states, etc. The bad news is recursion can’t help us with this kind of traversal. You will use a queue to store all the possibilities you want to explore. In order to refresh pointer skills, you must use a linked list representation of a queue. This needs to be code that you have written. Below is an example of the boards I would consider. The original game is the root. Notice that I eliminate moves which take me immediately back to the previous position. Can you explain why? Study the example below, by marking each arrow by the move that was taken to get to the next board. Because we try lots of possibilities, we need keep all the game we are working on. We define a state of the game to be the board and the history of how we got to the board. First, insert the initial state into a queue. Then, delete the state from the queue, and insert onto the queue all neighboring states (those that can be reached in one move) of the removed state. Repeat this procedure until you reach the goal state. Input Allow the user to either a. Create a random board. The user will select the number of times to jumble a board (as this controls how difficult the board will be to solve). Make sure the game is legal. b. Solve a specific board. Your code should work correctly for arbitrary N-by-N boards (for any 1 ≤ N ≤ 100), even if it is too slow to solve some of them in a reasonable amount of time. Output Use a file similar to SliderProject.cpp to generate test cases. The File prog1OUT.txt illustrates the format of the output For each puzzle 1. Show the series of moves needed to solve the puzzle. 2. Show the number of boards placed on the queue and the number of boards removed from the queue. Hints Use of assert One of the biggest problems we have with pointers is following NULL pointers. You will be amazed at how many problems are solved by being cautious. Before you follow a pointer, always check to see if it is NULL. (In fact, many of the base cases in recursion are simply checking for a NULL pointer.) If you are absolutely sure the pointer cannot be NULL, I would still recommend checking it via an assert statement. To use assert: #include If you think something is true, you simply state assert(statement); If the statement evaluates to true, nothing happens. If it doesn’t, you abort in a dramatic way (that can’t be missed). Use of stringstream For each data structure, I create a “toString” function which gives a printable version of the data structure. This is better than just using cout, as I can easily write to a different location. Since the only way I print the contents of a data structure is by calling toString, it is easy to modify the view I want to see. A stringstream allows you to use the power of input/output operators to create strings for you to use. The following example shows how I create a string version of the Board. // Create a printable representation of the Board class string Board::toString() { stringstream ss; for (int i=0; i < SIZE; i++) { for (int j = 0; j < SIZE; j++) ss left) + height(t->right) + 1; return max(rootW, max(leftW, rightW)); } However, the function as written does not run in O(n) time because computing height is O(n) work. Write a version of width that runs in O(n) time. Hint, use a function as described below. int widthHeight(TreeNode * t, int & height) // pre: t is a binary tree // post: return (via reference param) height = height of t // return as value of function: width of t {} Program 2 Page 4 9. (2 points) Two binary trees s and t are isomorphic if they have the same shape; the values stored in the nodes do not affect whether two trees are isomorphic. In the diagram below, tree a and tree b are isomorphic. Tree c is not isomorphic to the others. 10. (2 points) Two trees s and t are quasi-isomorphic if s can be transformed into t by swapping left and right children of some of the nodes of s. The values in the nodes are not important in determining quasi-isomorphism, only the shape is important. For the trees below tree a and tree b are quasi-isomorphic because if the children of the nodes 10 in tree a on are swapped, the tree having the same shape as tree b is obtained. All three trees below are quasiIsomorphic. Write a function isQuasiIsomorphic that returns true if two trees are quasi-isomorphic. 11. (2 points) Write a function which returns the least common ancestor of a node in a binary search tree. A least common ancestor is an ancestor of both nodes and is closest to the nodes. A node is considered to be an ancestor of itself. In the tree below, the least common ancestor of 82 and 8 is 20. The least common ancestor of 42 and 50 is 50 (even though 42 doesn’t exist). The least common ancestor of 57 and 40 is 50.PART 1 You have been given the AVL tree code from your author as a starting point. Reading code is a valuable experience. On the job, you are often asked to modify/maintain existing code. You can’t start over and do it your way. You must incorporate your changes into the exisiting code. You are expected to understand the code that has been given to you. Make the following changes: 1. Change all variable names to be meaningful names. 2. Make any changes needed to conform to our style guidelines. 3. Write a toString function which creates an indented version of the tree (as in program 2). 4. Implement the removeMin function. Remove the smallest node in the tree and rebalance. This function is to be your own work, not copied from ANY other source. The data to be stored in the AVL tree is to be generic and work for either an integer or a GameState (of the slider puzzle variety). Illustrate that the AVL tree is working properly by printing the tree (using toString) after each of the following groups of operations: • Add: 1 3 5 7 9 11 2 4 8 (now print tree) • Remove 7 9(now print tree) • Add 50 30 15 18 (now print tree) • Remove Min (now print tree) • Remove Min (now print tree) • Remove Min (now print tree) • Add 17(now print tree) PART 2 Use the AVL tree as a priority queue to improve your solution to the slider puzzle problem (of program 1). This is the idea. Suppose instead of looking at solutions in the regular order (which we used in program 1), what if you considered partial solutions which look closer? For example, which of the following is the closest? A priority queue is a queue such that insertions are made in any order, but when you remove a node from the queue, it is the one with the best priority. Our priority will be an estimation of the total amount of work needed to solve the problem – so a lower score is preferred. Thus, we will first consider boards that “look better” to us. In our previous solution, (program 1) , we considered all one-move solutions before considering all two move solutions, and so on. If we are smarter in which solutions we consider, we can improve the functioning of the solution. In this assignment, you will compare your previous solution to this new technique. We define a state of the game to contain at least: • board • cost so far: number of moves taken from initial state to reach current board • estimated cost to reach a solution • priority (cost so far + estimated cost to reach a solution) • String of moves indicating how the current board was achieved. Best-first search. We describe a solution to the problem that illustrates a general artificial intelligence methodology known as the A* search algorithm. First, insert the initial state into a priority queue. Then, delete from the priority queue the state with the smallest priority, and insert onto the priority queue all neighboring states (those that can be reached in one more move) using an appropriate estimated cost for each. Repeat this procedure until the state removed from the priority queue is the goal state. The success of this approach hinges on the choice of estimated cost or priority function. You can compute your “expected work function” any way you want as long it is reasonable and underestimates the real cost. Be creative. Here are two choices. Manhattan distance: For each number (other than the blank), give each entry a score showing how far it needs to be slid to get in the right location. (This is termed a Manhattan distance after the gridlike street geography of Manhattan.) Hamming distance: The number of tiles in the wrong position. Intuitively, a search node with a small number of tiles in the wrong position is close to the goal. The diagram below shows how the Manhattan distance could be used to aid the search. We make a key observation: to solve the puzzle from a given state on the priority queue, the total number of moves required (including those already made) is at least its priority, using our distance function. Consequently, as soon as we dequeue a board which has the goal state, we have not only discovered a sequence of moves from the initial board to the board associated with the state, but one that makes the fewest number of moves. When we dequeue, we know everything else we will enqueue in the future will take at least as many moves to reach the goal. This is true because our distance is an UNDER-estimate of the number of moves required. Output: Since we want to compare this version with our brute force solution in program 1, you will need to have both methods working. Create a class “SliderGame” which allows you to call either “bruteForceSolve” or “aStarSolve” on the same board. Each method should keep track of the total number of states that were pulled off the queue. 1. Show the output for the four boards shown above under the “Part 2” heading using both methods (brute force and A*). Your output should include a. Which method was used b. the initial board c. the sequences of moves used to solve the board (i.e., URRDLL) d. the “show me” sequence which shows what the board looks like after each move. e. total number of boards dequeued from the queue. 2. Find an example for which your “more intelligent” method saves significant time over the brute force method of program 1. Show the same output for this board. Hints During debug, when using the A* search, show each board as you pull it off the queue. Be sure to print all the data in the board state (history, priority, expected moves remaining). Print output both to the console and to a file. Then, you can easily remove the printing to the console (to make it run faster and easier to read). Getting the same code to work for ints and GameStates forces us to be more methodical in our approach. For example: The toString function, it looks something like: string toString( AvlNode *t, string indent) const { stringstream ss; if( t != nullptr ) { ss right, indent + ” “); ss
Overview In Project 1, you will design a relational database for storing information about your Fakebook social network. You will begin with a detailed description of the content. Then, you will need to systematically go through the conceptual and logical database design process you learned about in class. You can do the project either alone or in a group of two. If working in a group, a single submission is required. Part 1: ER Design You should do this to derive constraints for the relational model and to make sure you can go from requirements to ER model to Relational model. As a starting point, we have done the initial “requirements analysis” for you. The following is a brief description of the data that you will store in your database. (In real life, you would probably begin with much fuzzier information.) All IDs in the specs below are, of course, unique. User Information: There can be an unlimited number of users. Each user has the following information: ● Profile information This includes the following attributes: user ID, first name, last name, year of birth, month of birth, date of birth, gender. ● Hometown Location A user’s hometown includes the following attributes: city, state, country. ● Current Location Exactly the same attributes as hometown location. ● Education History A user’s educational history contains information on each college program attended, if any, with each college program attended containing the following attributes: name of the institution (e.g., University of Michigan), year of graduation, concentration (e.g., CS, EE, etc.), and degree (e.g., BS, MS, PhD, etc.). ● Friendship information Each user can have any number of friends. Each friend must also be a Fakebook user. Photos “Photos” is an important Fakebook application. It records the following information: ● Album information Each photo MUST belong to exactly one album. An album has the following attributes: album_ID, owner_ID (this refers to the owner’s Fakebook ID), album_name, cover_photo_ID (this refers to a photo ID), album_created_time, album_modified_time , album_link and album_visibility. ● Other information Each photo has the following attributes: photo_ID, photo_caption, photo_created_time, photo_modified_time, and photo_ link. ● Photo Tags Users can also interact by tagging each other. A photo tag identifies a Fakebook user in a photo. It has the following associated attributes: tag_photo_id (a Fakebook photo ID), tag_subject_id (a Fakebook user ID), tag_x_coordinate and tag_y_coordinate, and tag_created_time The database does not track who did the tagging. Note that there can be multiple tags at exactly the same (x, y) location. However, there can be only ONE tag for each subject in the photo; Fakebook doesn’t allow multiple tags for the same subject in a single photo. For example, you cannot tag Lady Gaga twice in a photo, even if she appears to be at two separate locations in the photo. Messages: Users can also send private messages to each other. ● Message information sender_ID (a Fakebook user ID), receiver_id (a Fakebook user ID), message_content (the text of the message), and sent_time In this version of Fakebook, there are no group messages. A user can, of course, send zero or more messages to different users. Events: “Events” is another useful Fakebook feature. ● Basic event information event_ID, event_creator_id (Fakebook user who created the event), event_name, event_tagline, event_description, event_host (this is a string, not a Fakebook user), event_type, event_subtype, event_location, event_city, event_state, event_country, event_start_time, and event_end_time ● Event participants Participants in an event must be Fakebook users. Each participant must have a confirmation status value (attending, declined, unsure, or not‐replied). The sample data does not have information on Event Participants, so you can leave the information on Participants empty. Task for Part 1 Your task in Part 1 is to perform “Conceptual Database Design” using ER Diagrams. There are many ER variants, but for this project, we expect you to use the conventions from the textbook and lecture. Hints for Part 1 You need to identify the entity sets and relationship sets in a reasonable way. We expect there to be multiple correct solutions; ER design is somewhat subjective. Your goal should be to capture the given information using ER constructs that you have learned about in class (participation constraints, key constraints, weak entities, ISA hierarchies and aggregation) as necessary. For the entity set, relationship set and attribute names, you can use the ones we have provided here, or you may also choose your own names, as long as they are intuitive and unambiguous. Before you get started, you should also read Appendix to understand the specifics of the data. Some of the ER diagram constraints are in Appendix. Part 2: Logical Database Design For the second part of the project, your task is to convert your ER diagrams into relational tables. You are required to write SQL DDL statements for this part. You should turn in two files: 1. createTables.sql 2. dropTables.sql As a starting point, we are giving you a set of tables, along with some columns. Your design must use these tables. But, you will need to add any integrity constraints so that the schema is as close to enforcing the requirements as is practical. You can add additional columns as well. Use the most appropriate types for the fields as well. The required tables and their schema are given below: USERS: USER_ID (NUMBER) FIRST_NAME (VARCHAR2(100)) LAST_NAME (VARCHAR2(100)) YEAR_OF_BIRTH (INTEGER) MONTH_OF_BIRTH (INTEGER) DAY_OF_BIRTH (INTEGER) GENDER (VARCHAR2(100)) FRIENDS: USER1_ID (NUMBER) USER2_ID(NUMBER) CITIES: CITY_ID (INTEGER) CITY_NAME(VARCHAR2(100)) STATE_NAME (VARCHAR2(100)) COUNTRY_NAME (VARCHAR2(100)) USER_CURRENT_CITY: USER_ID (NUMBER) CURRENT_CITY_ID (INTEGER) USER_HOMETOWN_CITY: USER_ID (NUMBER) HOMETOWN_CITY_ID (INTEGER) MESSAGE: MESSAGE_ID (INTEGER) SENDER_ID (NUMBER) RECEIVER_ID(NUMBER) MESSAGE_CONTENT (VARCHAR2(2000)) SENT_TIME (TIMESTAMP) PROGRAMS: PROGRAM_ID (INTEGER) INSTITUTION (VARCHAR2(100)) CONCENTRATION (VARCHAR2(100)) DEGREE (VARCHAR2(100)) EDUCATION: USER_ID (NUMBER) PROGRAM_ID (INTEGER) PROGRAM_YEAR (INTEGER) USER_EVENTS: EVENT_ID (NUMBER) EVENT_CREATOR_ID (NUMBER) EVENT_NAME (VARCHAR2(100)) EVENT_TAGLINE (VARCHAR2(100)) EVENT_DESCRIPTION (VARCHAR2(100)) EVENT_HOST (VARCHAR2(100)) EVENT_TYPE (VARCHAR2(100)) EVENT_SUBTYPE (VARCHAR2(100)) EVENT_LOCATION (VARCHAR2(100)) EVENT_CITY_ID (INTEGER) EVENT_START_TIME (TIMESTAMP) EVENT_END_TIME (TIMESTAMP) PARTICIPANTS: EVENT_ID (NUMBER) USER_ID (NUMBER) CONFIRMATION (VARCHAR2(100)) ALBUMS: ALBUM_ID (VARCHAR2(100)) ALBUM_OWNER_ID (NUMBER) ALBUM_NAME (VARCHAR2(100)) ALBUM_CREATED_TIME (TIMESTAMP) ALBUM_MODIFIED_TIME (TIMESTAMP) ALBUM_LINK (VARCHAR2(2000)) ALBUM_VISIBILITY (VARCHAR2(100)) COVER_PHOTO_ID (VARCHAR2(100)) PHOTOS: PHOTO_ID (VARCHAR2(100)) ALBUM_ID (VARCHAR2(100)) PHOTO_CAPTION (VARCHAR2(2000)) PHOTO_CREATED_TIME (TIMESTAMP) PHOTO_MODIFIED_TIME (TIMESTAMP) PHOTO_LINK (VARCHAR2(2000)) TAGS: TAG_PHOTO_ID (VARCHAR2(100)) TAG_SUBJECT_ID (NUMBER) TAG_CREATED_TIME (TIMESTAMP) TAG_X (NUMBER) TAG_Y (NUMBER) Keep the table and field names exactly as written above. Also, make sure you use the correct field types (e.g., number or integer) as specified above. Failure to do so may result in failing the autograder since the database is case and type sensitive. (Note: The ID types for various fields would normally be INTEGERs in practice, but they are not in this project for reasons other than technical, primarily, that the input data sets we are importing contains noninteger types for keys use it as a learning moment to deal with IDs of different types!) You need to decide what fields will be primary keys and what fields will be foreign keys(if necessary). Use the smallest candidate keys when possible for primary keys. Hints for Part 2 You should capture as many constraints from your ER diagrams as possible in your createTables.sql file. In your dropTables.sql, you should write the DROP TABLE statements necessary to destroy the tables you have created. (Also, for your own learning, it is good to check if your ER diagrams map to the tables we gave you. If not, you may want to discuss this in the discussions, piazza, or office hours to determine whether the fault lies in your ER diagrams or the tables we gave you). Using Oracle SQL*Plus, you can run your .sql files with the following commands: sqlplus / @ dropTables.sql sqlplus / @ createTables.sql Please double‐check that you can run the following sequence without errors in a single sql script. Otherwise, you may fail our auto‐grading scripts. Also remember to drop any triggers, constraints, etc., that you created. ● createTables.sql ● dropTables.sql ● createTables.sql ● dropTables.sql ● Part 3: Populate Your Database For this part of the project, you will populate your database with Fakebook data (please see Appendix 1 on where to find the dataset and its description). You should turn in the set of SQL statements (DML) to load data from the public tables (PUBLIC_USER_INFORMATION , etc.) into your tables. You should put all the statements into a file called “loadData.sql”. Hints for Part 3 There will be some variations depending on the schema that you choose. In most cases, however, you can load data into your tables using very simple SQL commands. Please double‐check that you can run the following sequence without errors in a single sql script. Otherwise, you may fail our auto‐grading scripts. Also remember to drop any triggers, constraints, etc., that you created. ● createTables.sql ● loadData.sql ● dropTables.sql Your loadData.sql must load from our PUBLIC dataset given in the Appendix, not from a private copy. We will be testing your system against hidden datasets and therefore need your loadData.sql to be loading from the specified dataset. Otherwise, you will fail our tests. One concern you might have is how to handle the constraint on regard Friend data. For this project, when loading the data, ensure that only the necessary data is loaded. For example, if the original data contains (2,7) and (7,2), only load one of these two values. Loading both or neither would be incorrect. After the data has been loaded, you only need to ensure that any insertion of new data does not break the no duplication constraint. This can either be done by rejecting any insert or batch insert which would violate the constraint or only accepting valid data and rejecting the rest. The first option tends to be easier. Part 4: Create views on your database As a final task, you will create some views on your tables. Here is what we would like: Define views to recreate the same schemas as the PUBLIC tables (see Appendix). The rows in a view do not have to be in exactly the same order as in the corresponding table in the PUBLIC datasets, but the schema must be identical. The columns must have identical names and types. You can check the schema of the PUBLIC tables by using the “DESC TableName” command. For the public dataset, the original data satisfied all the integrity constraints, each view will have the same set of rows as in the corresponding input table. Name your view tables as follows (correspondence to the public tables should be obvious ‐‐ See Appendix later) ● VIEW_USER_INFORMATION ● VIEW_ARE_FRIENDS ● VIEW_PHOTO_INFORMATION ● VIEW_TAG_INFORMATION ● VIEW_EVENT_INFORMATION Turn in the following files that create and drop the views: ● createViews.sql ● dropViews.sql Hints for Part 4: 1. You should check that the following sequence works correctly in a single script (no errors). ● createTables.sql ● loadData.sql ● createviews.sql ● dropViews.sql ● dropTables.sql 2. You should also check for the provided dataset that createViews.sql results in identical tables to the provided tables. For example, the following two statements check if the public dataset is the same as your created view. ○ SELECT * FROM keykholt.PUBLIC_USER_INFORMATION MINUS SELECT * FROM VIEW_USER_INFORMATION; ● SELECT * FROM VIEW_USER_INFORMATION MINUS SELECT * FROM keykholt.PUBLIC_USER_INFORMATION; If both queries return no results, then the rows in both are the same. 3. It is not necessary to exactly recreate the PUBLIC_ARE_FRIENDS table since it is not guaranteed that for every (x,y) row entry, there is a corresponding (y,x) entry. For the VIEW_ARE_FRIENDS, the requirement is that for every (x,y) entry in the public dataset, it either has a (x,y) or (y,x) entry, but not both. For example, if the public dataset has both (2,7) and (7,2), your view should contain only (2,7) or (7,2). 4. DOUBLE CHECK YOUR SCHEMA! Many students forget to check their schema and fail the autograder because of it. Submission Checklist Please put all your files in a single zip file called project1.zip and submit a single file. We will post instructions later. 1. partner.txt: List the partners who worked together. If you worked alone, just list your name. Remember to follow the Engineering Honor Code and Course policies in Lecture00. 2. ER Diagram. Filename should be er.pdf. 3. Five SQL files a. createTables.sql (Part 2) b. dropTables.sql (Part 2) c. loadData.sql (Part 3) d. createViews.sql (Part 4) e. dropViews.sql (Part 4) Both partners should submit separately, even if the submission is identical. Late policy applies individually (even if submissions are identical). How to create a zip file? Log into a Linux machine. Put all your submission files into one folder % zip ‐r project1.zip partner.txt er.pdf createTables.sql dropTables.sql loadData.sql createViews.sql dropViews.sql You MUST create the zip file using the above command as exactly typed. That ensures that you include the correct set of files with exactly the right names. You can add in a README.txt file if you wish as well for any additional information. To test that your zip file contains everything, email or copy the zip to another machine or folder and unzip it to make sure you are able to extract all the files. We will update the instructions on how precisely submit your zip file to us by next week. Appendix: Description of the Fake data set for Part 3 This section describes the format of the fake data we will provide you to load into your database Fake social network data Everyone will have access to a fake data set, which is designed to emulate a social network dataset. The fake data includes the following five tables: PUBLIC_USER_INFORMATION PUBLIC_ARE_FRIENDS PUBLIC_PHOTO_INFORMATION PUBLIC_TAG_INFORMATION PUBLIC_EVENT_INFORMATION These tables are stored in the GSI’s account (keykholt). You can access the public tables for the fake data using GSI’s account name (keykholt). For example, to access the PUBLIC_USER_INFORMATION table, you need to refer to the table name as keykholt.PUBLIC_USER_INFORMATION. You can copy the data into your own account with the following command: CREATE TABLE NEW_TABLE_NAME AS (SELECT * FROM keykholt.TABLE_NAME); The data will then be stored into your personal Oracle space. You can login to SQL*Plus to browse the data. The fake data tables we provide actually give you some hints on the previous parts of the assignment. However, these tables are highly “denormalized” (poorly designed), and without any table constraints. As mentioned earlier, the table names are: PUBLIC_USER_INFORMATION PUBLIC_ARE_FRIENDS PUBLIC_PHOTO_INFORMATION PUBLIC_TAG_INFORMATION PUBLIC_EVENT_INFORMATION The fields of those tables are as follows: PUBLIC_USER_INFORMATION table: 1. USER_ID This is the Fakebook unique ID for users 2. FIRST_NAME Every user MUST have a first name on file 3. LAST_NAME Every user MUST have a last name on file 4. YEAR_OF_BIRTH Some users may not provide this information 5. MONTH_OF_BIRTH Some users may not provide this information 6. DAY_OF_BIRTH Some users may not provide this information 7. GENDER Some users may not provide this information 8. HOMETOWN_CITY Some users may not provide this information 9. HOMETOWN_STATE Some users may not provide this information 10. HOMETOWN_COUNTRY Some users may not provide this information 11. CURRENT_CITY Some users may not provide this information 12. CURRENT_STATE Some users may not provide this information 13. CURRENT_COUNTRY Some users may not provide this information 14. INSTITUTION_NAME Some users may not provide this information. A single person may have studied in multiple institutions (college and above). 15. PROGRAM_YEAR Some users may not provide this information. A single person may have enrolled in multiple programs. 16. PROGRAM_CONCENTRATION Some users may not provide this information. This is like a short description of the program. 17. PROGRAM_DEGREE Some users may not provide this information. PUBLIC_ARE_FRIENDS table: 1. USER1_ID 2. USER2_ID Both USER1_ID and USER2_ID refer to the values in the USER_ID field of the USER_INFORMATION table. If two users appear on the same row, it means they are friends; otherwise they are not friends. A pair of users should only appear once in the table (i.e., a pair should only appear in one of the two possible orders). PUBLIC_PHOTO_INFORMATION table: All attributes must be present unless otherwise specified 1. ALBUM_ID ALBUM_ID is the Fakebook unique ID for albums. 2. OWNER_ID User ID of the album owner. 3. COVER_PHOTO_ID Each album MUST have one cover photo and the photo must be in the album. The values are the Fakebook unique IDs for photos. 4. ALBUM_NAME 5. ALBUM_CREATED_TIME 6. ALBUM_MODIFIED_TIME 7. ALBUM_LINK The unique URL directly to the album 8. ALBUM_VISIBILITY It is one of the following values: EVERYONE, FRIENDS_OF_FRIENDS, FRIENDS, MYSELF, CUSTOM 9. PHOTO_ID This is the Fakebook unique ID for photos. 10. PHOTO_CAPTION An arbitrary string describing the photo. This field is not necessarily populated. 11. PHOTO_CREATED_TIME 12. PHOTO_MODIFIED_TIME 13. PHOTO_LINK The unique URL directly to the photo PUBLIC_TAG_INFORMATION table: All attributes must be populated. 1. PHOTO_ID Unique Id of the corresponding photo 2. TAG_SUBJECT_ID Unique Id of the corresponding user 3. TAG_CREATED_TIME 4. TAG_X_COORDINATE 5. TAG_Y_COORDINATE PUBLIC_EVENT_INFORMATION table: All required unless otherwise specified 1. EVENT_ID This is the Fakebook unique ID for events. 2. EVENT_CREATOR_ID Unique Id of the user who created this event 3. EVENT_NAME 4. EVENT_TAGLINE Not necessarily provided 5. EVENT_DESCRIPTION Not necessarily provided 6. EVENT_HOST 7. EVENT_TYPE Fakebook has a fixed set of event types to choose from a drop‐down menu. 8. EVENT_SUBTYPE Fakebook has a fixed set of event subtypes to choose from a drop‐down menu. 9. EVENT_LOCATION User entered arbitrary string. For example, “my backyard”. Not necessarily provided 10. EVENT_CITY Not necessarily provided. 11. EVENT_STATE Not necessarily provided. 12. EVENT_COUNTRY Not necessarily provided. 13. EVENT_START_TIME 14. EVENT_END_TIME Oracle and SQL*Plus This section describes how to get started using Oracle and SQL*Plus. Logging in to your Oracle Account First, connect to login.engin.umich.edu using SSH with your UMich account (uniqname and Kerberos password). Then execute: module load eecs484/f16 NOTE: if you add the “module load” command to your ~/.bash_profile, it will always be executed when you log in to your CAEN account. Then, to connect to the Oracle server, you will just have to enter the sqlplus command. Enter the user name and password for your Oracle account to login. The default password is eecsclass. When you log in the first time, you will be prompted to change your password. Oracle passwords can contain any alpha numeric characters and underscore (_), dollar ($), and number sign (#). Do not use quotation marks or the @ symbol in your new password. If you do, and find that you cannot log in, email one of the instructors to reset your password. After that, you can type SQL commands to interact with the database system. Note that you must end every statement you want to execute with a semicolon. To disconnect from Oracle you can execute: EXIT Try this early! If you have trouble accessing your Oracle account, please speak to the GSI.Overview While Project 1 focused primarily on database design, in this project you will focus on writing SQL queries. In addition, you will embed your SQL queries into Java source code (using JDBC) to implement “Fakebook Oracle,” a standalone program that answers several queries about the Fakebook database. For this project, you will use a standardized schema that we provide to you, rather than the schema that you designed in Project 1. You will also have access to our public Fakebook dataset for testing. 1. Files Provided to You You are provided with 3 Java files: TestFakebookOracle.java , FakebookOracle.java and MyFakebookOracle.java . We have also provided a jar file ojdbc6.jar. Place all these 4 files, including the jar file under a folder project2. In addition, we have provided you a file solution-public. txt showing sample query results. When submitting your completed project, you only need to turn in MyFakebookOracle.java . 1) TestFakebookOracle.java This file provides the main function for running the program. You can use it to run and test your program, but you don’t need to turn it in. Please only modify the oracleUserName and password static variables, replacing them with your own information. 1 2) FakebookOracle.java DO NOT modify this file, although you are welcome to look at the contents if you are curious. This class defines the query functions (discussed below in Section 3) as abstract functions, which you must implement for Project 2. It also defines some useful data structures and provides formatted print functions for you to make use of. 3) MyFakebookOracle.java This is a subclass of FakebookOracle , in which you must implement the query functions. You should ONLY fill in the body for each of the query functions. DO NOT make any other changes. In this project, you only need to store the results of the queries in our predefined data structures (which we have provided as member variables in the class). You don’t need to worry about output formatting. In the base class FakebookOracle.java , a set of print functions have been provided for you to view the query results. The MyFakebookOracle class contains parameterized names for the tables you will need to use in your queries, and they are constructed in the class constructor as shown in the following lines of code. You should always use the corresponding variable when you are referring to a table in the SQL statement to be executed through JDBC. For example, you should use the variable cityTableNam e instead of using a constant value such as ‘PUBLIC_CITIES’ in your Java code. 2 For creating and managing the database connection, you should use the predefined object oracleConnection. 4) solution-public.txt This file contains the output query results from running our official solution implementation on the public dataset provided to you. You can make use of this to check whether or not your queries are generating the same results from the same input dataset. Note that your submission will be graded using a different input dataset, so producing correct results on the public dataset is not a guarantee that your solution is entirely correct. Make sure that your queries are designed to work more generally on any valid input dataset, not just the sample data we provide. Also, think carefully about the semantics of your queries since it may not be always possible to test them in all scenarios and you often will not have the benefit of knowing the correct answers in practice. 2. Tables For this project, your schema will consist of the following twelve tables: 1. ._USERS 2. ._FRIENDS 3. ._CITIES 3 4. ._PROGRAMS 5. ._USER_CURRENT_CITY 6. ._USER_HOMETOWN_CITY 7. ._EDUCATION 8. ._USER_EVENTS 9. ._PHOTOS 10. ._ALBUMS 11. ._TAGS To Access Public Fakebook Data: should be replaced with “PUBLIC” to access the public Fakebook data tables. The public data tables are stored in the GSI’s account name (tajik). Therefore, you should use the GSI’s account name as in order to access the public tables directly within SQL*Plus for testing your queries. For example, to access the public USERS table, you should refer to the table name as tajik.PUBLIC_USERS . In the Java files provided to you, the above table names are already pre-configured in the given code. 3. Queries (100 points) There are 10 total queries (Query 0 to Query 9). Query 0 is provided to you as an example, and you are left to implement the remaining nine. The points are as shown. The queries vary tremendously in terms of difficulty. If you get stuck on a harder query, try an easier one first and come back to the tough one later. For all of these queries, sample answers on the given data are available in the attached zip file. If the English description is ambiguous, please look at the sample answers. Also, for all of these queries, when feasible, you should try to do most of heavy-lifting to answer the query within SQL. For example, if a query requires you to present the data in sorted order, use ORDER BY in your query rather than retrieving the result and then sorting it within Java. 4 Also, the grading program we use does impose a time limit on the time it waits for a query. If a query appears to be taking too much time, you should consider rewriting it in a different way to make it faster. Nested queries are usually more expensive to run. Query 0: Find information about month of birth (0 points) This function has been implemented for you in MyFakebookOracle.java , so that you can use it as an example. The function computes the month in which the most users were born and the month in which the fewest users were born. The names of these users are also retrieved. The sample function uses the Connection object, oracleConnection, to build a Statement object. Using the Statement object, it issues a SQL query, and retrieves a ResultSet. It iterates over the ResultSet object, and stores the necessary results in a Java object. Finally, it closes both the Statement and the ResultSet objects. Query 1: Find information about names (10 points) The next query asks you to find information about users’ names, including 1) the longest first name, 2) the shortest first name, and 3) the most common first name. If there are ties, you should include all of the matches in your result. The following code snippet illustrates the data structures that should be constructed. However, it is up to you to add your own JDBC query to answer the question correctly. 5 Query 2: Find “lonely” users (10 points) The next query asks you to find information about all users who have no friends in the network. Again, you will place your results into the provided data structures. The sample code in MyFakebookOracle.java illustrates how to do this. Query 3: Find “world travelers” (10 points) The next query asks you to find information about all users who no longer live in their hometowns. In other words, the current_city associated with these users should NOT be the same as their hometown_city (neither should be null). You will place your result into the provided data structures. Query 4: Find information about photo tags (10 points) For this query, you should find the top n photos that have the most tagged users. You will also need to retrieve information about each of the tagged users. If there are ties (i.e. photos with the same number of tagged users), then choose the photo with smaller id first. This will be string lexicographic ordering since the data types are VARCHARs (for instance, “10” will be less than “2”). Query 5: Find users to set up on dates (15 points) For this task, you should find the top n “match pairs” according to the following criteria: (1) One of the users is female, and the other is male (Note: they do not have to be friends of the same person) (2) Their age difference is prompt: => password Remember your new password. Resetting it is possible but will incur a delay (possibly even 24-48 hours) as it would be a manual process. If you need to reset it, then you should email the teaching staff and allow us 24-48 hours to respond. CAEN or ITD do not support this database system. Initializing the Database This project comes with a zip file (dvd_store_data.zip) that contains relevant commands to initialize the database. Download and unzip the file and you will find a file called setup_db.sql. You will use this file to populate the database. Specifically, you can do the following steps on CAEN to initialize the database: % unzip dvd_store_data.zip % cd dvd_store_data % psql -h eecs484.eecs.umich.edu -U uniquename Password for user uniqname: You are now connected to the Postgres interactive terminal. Run the following command (note the backslash) to execute commands from setup_db.sql: i setup_db.sql It can take a few minutes for the database to be initialized. Here is what you may see when initializing (or re-initializing). The error messages about permission being denied on triggers can be ignored. uniquename=> i setup_db.sql psql:pgsqlds2_delete_all.sql:2: ERROR: permission denied: “RI_ConstraintTrigger_19357” is a system trigger psql:pgsqlds2_delete_all.sql:3: ERROR: permission denied: “RI_ConstraintTrigger_19359” is a system trigger psql:pgsqlds2_delete_all.sql:4: ERROR: permission denied: “RI_ConstraintTrigger_19409” is a system trigger psql:pgsqlds2_delete_all.sql:5: ERROR: permission denied: “RI_ConstraintTrigger_19391” is a system trigger psql:pgsqlds2_delete_all.sql:6: ERROR: permission denied: “RI_ConstraintTrigger_19424” is a system trigger psql:pgsqlds2_delete_all.sql:7: ERROR: permission denied: “RI_ConstraintTrigger_19383” is a system trigger DROP TABLE DROP TABLE psql:pgsqlds2_create_tbl2.sql:30: NOTICE: CREATE TABLE will create implicit sequence “customers_customerid_seq” for serial column “customers.customerid” psql:pgsqlds2_create_tbl2.sql:30: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “customers_pkey” for table “customers” CREATE TABLE psql:pgsqlds2_create_tbl2.sql:43: NOTICE: CREATE TABLE will create implicit sequence “orders_orderid_seq” for serial column “orders.orderid” psql:pgsqlds2_create_tbl2.sql:43: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “orders_pkey” for table “orders” CREATE TABLE psql:pgsqlds2_create_tbl2.sql:51: NOTICE: CREATE TABLE will create implicit sequence “categories_category_seq” for serial column “categories.category” psql:pgsqlds2_create_tbl2.sql:51: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “categories_pkey” for table “categories” CREATE TABLE INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 psql:pgsqlds2_create_tbl2.sql:82: NOTICE: CREATE TABLE will create implicit sequence “products_prod_id_seq” for serial column “products.prod_id” psql:pgsqlds2_create_tbl2.sql:82: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “products_pkey” for table “products” CREATE TABLE CREATE TABLE CREATE TABLE psql:pgsqlds2_create_tbl2.sql:115: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “inventory_pkey” for table “inventory” CREATE TABLE CREATE TABLE psql:pgsqlds2_load_orderlines.sql:1: ERROR: permission denied: “RI_ConstraintTrigger_20643” is a system trigger psql:pgsqlds2_load_orderlines.sql:16: ERROR: permission denied: “RI_ConstraintTrigger_20643” is a system trigger Helpful Links Now, you should proceed to do the quiz at the Google Forms link given at the beginning. When you do the quiz, you may have to refer to Postgres documentation and run postgres commands to find the answers to the quiz questions. Some of the tables that you will be using during the course are system catalog pages. In most databases, there are tables about tables that you create. For example, one of the tables in Postgres is pg_class that has general information about all the relations, indexes, etc., including their column names. Another table is pg_stats, which contains approximate number of tuples in each table, number of distinct values in each column, etc. These two tables are very useful in query optimization. The data in these tables helps the query optimizer estimate the cost of different ways of evaluating a query (e.g., whether to use an index or to ignore it). For example, ignoring an index and just doing a regular file scan may be more efficient in some cases (e.g., see Lecture Notes where examples of SELECT queries on age being 18 for UM students were discussed). The following are links to Postgres documentations relevant to this assignment. Be prepared to look up the documentation as you work on the exercises. ● Full PostgreSQL 8.4.16 documentation: https://www.postgresql.org/docs/8.4/static/index.html ● System catalogs that give you information about tables: https://www.postgresql.org/docs/8.4/static/catalogs.html ● Statistics used by the query planner https://www.postgresql.org/docs/8.4/static/planner-stats.html ● How to manipulate the query planner (such as disabling the use of a join algorithm) https://www.postgresql.org/docs/8.4/static/runtime-config-query.html ● Syntax of EXPLAIN command https://www.postgresql.org/docs/8.4/static/sql-explain.html ● How to use EXPLAIN command and interpret its output https://www.postgresql.org/docs/8.4/static/using-explain.html ● Creating an index https://www.postgresql.org/docs/8.4/static/sql-createindex.html ● Creating a clustered index https://www.postgresql.org/docs/8.4/interactive/sql-cluster.html MongoDB (80 points) Introduction In this project, you will learn how to transfer data from SQL to MongoDB and learn to perform MongoDB queries. There are two parts in this project. In part 1, you will need to write a Java program to export a small portion of Fakebook data stored in Project 2 tables into one JSON file which serves as the input to part 2. In part 2, you will need to import this JSON file into MongoDB and perform a couple of queries in the form of JavaScript functions. (Note: we have provided you the JSON file that should result from your part 1, to allow you to work on part 1 and part 2 in either order, and to help you check the correctness of part 1. See more on this later in this document.) JSON objects and arrays MongoDB uses a format called JSON (Javascript Object Notation) extensively. JSON is a key-value representation, in which values can also be JSON objects. Here are some examples of objects in JSON notation: Example 1 (JSON object): {“firstName”:”John”, “lastName”:”Doe”} Here, “firstName” is a key, and the corresponding value is “John”. Similarly, “lastName” is a key and “Doe” is a value. Think of it like a map. Here is some Javascript code that uses the above: var employee = {“firstName”:”John”, “lastName”:”Doe”}; employee[“firstName”]; // displays John One can also have a JSON array, which are an array of JSON objects. For example, variable employees is an array of JSON objects. var employees = [ {“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Atul”, “lastName”:”Prakash”}, {“firstName”:”Barzan”,”lastName”: “Mozafari”} ]; employees[1][“firstname”]; // prints Atul Nesting is possible. For example, in a key-value pair, the value can be a JSON array, another JSON object, or just a simple string or number. MongoDB does not use SQL. But there are some similarities. Here are a few simple examples. See the following more examples: https://docs.mongodb.org/manual/reference/sql-comparison/ SQL MongoDB Table Collection. Initialized using a JSON array. Tuple or row of a table Document. Corresponds to a JSON object SELECT * FROM users; db.users.find(); SELECT * FROM users WHERE name = ‘John’ AND age = 50; db.users.find({name : “John”, age : 50}); SELECT user_id, addr FROM users WHERE name = ‘John’; db.users.find({name : “John”}, {user_id : 1, addr : 1, _id : 0}); Install mongodb on your local machine: We encourage you to install mongodb locally in your laptop to have a more pleasant development environment and also to explore mongdb’s functionality by yourself. You can follow instructions in the following links. Post on piazza if you get stuck and feel free to help other students get mongo installed. A great way is to reply to any questions related to installation on Piazza, based on your experience. Google search on installation error messages can also help. Install mongodb on Mac: https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-os-x/ (Note: You may need to use sudo command at times to temporarily become root if you get Permission Denied errors). Install mongodb on Windows: https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-windows/ Install mongodb on Linux https://docs.mongodb.com/v3.2/administration/install-on-linux/ (Note: You may need to use sudo command at times to temporarily become root if you get Permission Denied errors.) When you have it successfully installed, the following commands should work from a Terminal: % mongo % mongoimport If you have trouble installing locally, you can also use the mongo server on eecs484.eecs.umich.edu directly by supplying a userid and password. See Part 2 below. Part 1 – Export SQL data to JSON using JDBC This part does not really make use of MongoDB. It instead relies on your knowledge from Project 2 (SQL!). You will be retrieving data from the public tables of Project 2 and outputting a subset of the information as a JSON array. We have given you the output file as well that we are expecting as output so that you can check your answer. You are provided with 3 files: GetData.java, Main.java, and Makefile for this part. You will write code inside GetData.java, resulting a file output.json . Also, you are provided sample.json, which is one of the possible correct outputs. You should get an equivalent JSON array in output.json as in sample.json. 1) Main.java This file contains the main function to run the program. The only modification you need to make to this file is to provide your SqlPlus username and password that you used in Project 2 as well. Please refer to your project 2 files in case you forgot what it is, since it is probably embedded in one of the files there. public class Main { static String dataType = “PUBLIC”; static String oracleUserName = “username”; //replace with your Oracle account name static String password = “password”; //replace with your Oracle password … 2) GetData.java This file contains the function you need to write to export data from SqlPlus to a JSON file. The function will output a JSON file called “output.json” which you will need to submit. public JSONArray toJSON() throws S QLException{ // Your implementation goes here…. // This is an example usage of JSONArray and JSONObject // The array contains a list of objects // All user information should be stored in the JSONArray object: users_info // You will need to DELETE this stuff. This is just an example. // A JSONObject is an unordered collection of name/value pairs. Add a few name/value pairs. JSONObject test = new JS ONObject(); / / declare a new JSONObject // A JSONArray consists of multiple JSONObjects. JSONArray users_info = new JS ONArray(); test.put(“user_id”, “testid”) ; / / populate the JSONObject test.put(“first_name”, “testname”); JSONObject test2 = new JS ONObject(); test2.put(“user_id”, “test2id”); test2.put(“first_name”, “test2name”); users_info.add(test); / / add the JSONObject to JSONArray users_info.add(test2); / / add the JSONObject to JSONArray return users_info; } You need to use JDBC to query relevant project 2 tables to get information about users and store the information in a JSONArray called users_info. It is OK to use multiple queries — in fact, that may be more convenient to do. The users_info object is a JSONArray and contains an array of JSONObjects. Each JSONObject should contain information about one user. For each user (stored as a JSONObject), you will need to retrieve the following information: ● user_id ● first_name ● last_name ● gender ● YOB ● MOB ● DOB ● hometown (JSONObject). Hometown contains city, state and country. ● friends (JSONArray). Friends contains an array of user_ids which are greater than the user_id of that user. Following is an example of one user JSONObject (order of key/value pairs in JSONObject doesn’t matter). See the file sample.json for an example valid output. 3) Makefile We provide a simple Makefile to compile the Java files and run them. You may make changes to this file if necessary. To compile the code, do: $ make To run the code, do: $ make run An output file output.json should result. 4) output.json Since the order of attributes inside a jsonObject is not fixed, there are a lot of correct answers for output.json . However, when you import them into database, they should be all identical for queries. For your convenience, we have provided a sample.json, which is one of the correct answers. To test whether your output.json is correct, you could do Part 2 with your output.json as the input instead of sample.json . If you get the same answers, that is a good sign (though not a proof). If you get different answers, then something is definitely wrong your output.json . Part 2 – Query MongoDB The first step is to import output.json file from part 1 to MongoDB as a collection. Alternatively, you can use sample.json as your input file. If you are working on CAEN machine, use the following command to input sample.json into the database, for example: $ module load mongodb $ mongoimport –host eecs484.eecs.umich.edu –username –password –collection users –db –file sample.json –jsonArray You can also do this by modifying the Makefile and then doing: $ make setupsampledb Alternatively, to use your output.json, you can do: $ make setupmydb On eecs484 server, we have set up Mongo databases for each student. The database name is your uniquename, and the username is also your uniquename. Password is eecs484class for all student, you can change your password for your database, see(https://docs.mongodb.org/manual/tutorial/manage-users-and-roles/). What you need to do is db.updateUser(, {pwd : “newpassword”}) You can do it either in mongo shell or run it as script (If you are using a private, local copy of mongodb on your personal machine, you can instead just omit the –host, –username, and –password flags). When importing a json file, please use “users” as your collection name, and do not modify this collection. The above command does that. You can create additional collections besides if you want as helper collections to answer the queries. For the second step of this part, you will need to write 5 queries in the form of JavaScript functions to query MongoDB. MongoDB can load JavaScript files. You can find more information via the following link. https://docs.mongodb.org/manual/tutorial/write-scripts-for-the-mongo-shell/ If a collection is created in a query, you may reuse that collection in subsequent queries to save time. Note: Since only hometown information is retrieved in part 1. We assume that the city in queries means hometown city. Query 1: Find users who live in a certain city This query should return a javascript array of user_ids who live in the specified city. City is passed as a parameter of function find_user. Hint: A possible answer would start out like: var result = db.users.find(….); // Read MongoDB tutorials on the find command. In addition, you may find the following useful. https://docs.mongodb.org/v3.0/reference/method/cursor.forEach/ Instead of using forEach, you can also iterate over the cursor result that is returned by result, and push the user_id from the result into a Javascript array variable. function find_user(city, dbname){ db = db.getSiblingDB(dbname) //implementation goes here // returns a Javascript array. See test.js for a partial correctness check. This will be // an array of integers. The order does not matter. } Note: Query 2-5 assume that the variable db has been initialized from Query 1 above. Do not drop the db database. Query 2: Unwind friends Each document in the collection represents one user’s information, including a list of friends’ id. In this query, you need to unwind the friends list such that the resulting collection contains document which represents a friend pair. You don’t need to return anything. The new collection must be named flat_users. You may find this link on MongoDB unwind helpful. https://docs.mongodb.org/manual/reference/operator/aggregation/unwind/ function unwind_friends(dbname){ db = db.getSiblingDB(dbname) //implementation goes here // returns nothing. It creates a collection instead as specified above. } You may also find the following useful: https://docs.mongodb.org/manual/reference/operator/aggregation/ In particular, besides $unwind, $project and $out can also be useful. $out can create a collection directly. Instead of $out, you can also iterate over the resulting cursor from the query and use insert operator to insert the tuples into flat_users. See the documentation link in the query below as well. Query 3: Create a city to user_id mapping Create a new collection. Documents in the collection should contain two fields: _id field holds the city name, users field holds an array of user_id who live in that city. You don’t need to return anything. The new collection must be named cities. You may find the following link helpful. https://docs.mongodb.org/manual/reference/operator/aggregation/out/ function cities_table(dbname) { db = db.getSiblingDB(dbname) //implementation goes here // Returns nothing. Instead, it creates a collection inside the database. } Query 4: Recommend friends Find user pairs such that, one is male, the other is female, their year difference is less than year_diff, and they live in same city and they are not friends with each other. Store each friend pair as an array (male first, female second), and store all pairs as an array of arrays, and return the array at the end of the function. function suggest_friends(year_diff, dbname) { db = db.getSiblingDB(dbname) //implementation goes here // Return an array of arrays. } Query 5: Find the oldest friend Find the oldest friend for each user who has friends. For simplicity, use only year of birth to determine age. If there is a tie, use the one with smallest user_id. Notice that in the collection, each user only has the information of friends whose user_id is greater than that user, due to the requirement in Fakebook database. You need to find the information of friends who have smaller user_ids than the user. You should find the idea of query 2 and 3 useful. Return a javascript object with keys as user_ids and the value of keys is the oldest friend’s id. The number of keys of the object should equal to the number of users who has friends. function oldest_friend(dbname){ db = db.getSiblingDB(dbname) //implementation goes here //return an javascript object described above } Query 6: Find the Average friend count for users We define the `friend count` as the number of friends of a user. The average friend count is the average `friend count` towards a collection of users. In this function we ask you to find the `average friend count` for the users collection. function find_avg_friend_count(dbname){ db = db.getSiblingDB(dbname) //implementation goes here //return an javascript object described above } Return a decimal as the average user friend count of all users in the “users” collection. Query 7: Find the city average friend count using MapReduce MapReduce is a very powerful yet simple parallel data processing paradigm. Please refer to the discussion 8 (will be released on 11/30) for more information. In the question we are asking you to use MapReduce to find the “average friend count” at the city level (i.e., average friend count per user where the users belong to the same city). We have set up the MapReduce calling point in the test.js and we are asking to write the mapper, reducer and finalizer (if needed) to find the average friend count per user for each city. You may find the following link helpful. https://docs.mongodb.com/v3.2/core/map-reduce/ var city_average_friendcount_mapper = function() { // implem ction of average friend count }; var city_average_friendcount_reducer = function(key, values) { // implement the reduce function of average friend count }; var city_average_friendcount_finalizer = function(key, reduceValue) { // We’ve implemented a simple forwarding finalize function. This implementation is naive: it just forwards the reduceVal to the output collection. // Feel free to change it if needed. However, please keep this unchanged: the var ret should be the average friend count per user of each city. var ret = reduceVal; return ret; }; Sanity check: Note that after running the test.js, running db.friend_city_population.find() in mongo shell, you should find the documentation friend_city_population have the records in the similar form as following: { “_id” : “some_city”, “value” : 15.23} Where the _id is the name of the city, and the value is the average friend count per user. Sample test We offer a test.js which will call all 7 query javascript files and will print “Query x is correct!” if you query passes the simplistic test in test.js. In test.js, you need to put your database name in the dbname variable. Please make sure that all 7 query javascript files are in the same directory as test.js. Also, please note that the autograder will use a similar program as test.js but is more exhaustive. In particular, we compare the output of your queries against a reference output in more depth. You are free to make changes on test.js for more exhaustive testing. To run the tests from command-line, you can do: $ module load mongodb # This is only required one-time per login. $ mongo -u -p –host eecs484.eecs.umich.edu < test.js This will access to your database(created as your uniquename) on eecs484 mongodb server and run the script test.js. Alternatively, you can use the Makefile as well. $ make mongoquerytest will basically do the above for you. If you want to open the mongodb shell, you will do: $mongo -u -p –host eecs484.eecs.umich.edu Again, if you are using a local mongodb on your personal computer, just do the following instead: $ mongo dname < test.js Note: it may take some time to run query 4 and query 5. However, query 4 should take less than 3 minutes and query 5 should take less than 6 minutes. You will receive 0 on that query if it exceeds this time limit. What to submit You should submit a zip file named p4.zip. The zip file should include: GetData.java — Your java program to create the json file. Do not change the code for writing to output.json. query1.js — return a javascript array of user_ids. query2.js — create a new collection called flat_users, nothing to return. query3.js — create a new collection called cities, nothing to return. query4.js — return a javascript array of user pairs. query5.js — return an javascript object, keys are user_ids, value for each key is the oldest friend id for that user_id query6.js — return a decimal which is the “average friend count” for all users query7.js finish the mapper, reducer and finalizer Where to submit For Mongodb part of the project, we have made an autograder available at grader484.eecs.umich.edu. You will submit the mongodb part of the project online. (Postgres portion is to be submitted via Google forms on the link given earlier in the specs.)
1. (10 points) Let {x1, x2, . . . , xn} be a set of points in d-dimensional space. Suppose we wish to produce a single point estimate µ ∈ R d that minimizes the squared-error: kx1 − µk 2 2 + kx2 − µk 2 2 + . . . + kxn − µk 2 2 Find a closed form expression for µ and prove that your answer is correct. 2. (10 points) Not all norms behave the same; for instance, the `1-norm of a vector can be dramatically different from the `2-norm, especially in high dimensions. Prove the following norm inequalities for d-dimensional vectors, starting from the definitions provided in class and lecture notes. (Use any algebraic technique/result you like, as long as you cite it.) a. kxk2 ≤ kxk1 ≤ √ dkxk2 b. kxk∞ ≤ kxk2 ≤ √ dkxk∞ c. kxk∞ ≤ kxk1 ≤ dkxk∞ 3. (10 points) When we think of a Gaussian distribution (a bell-curve) in 1, 2, or 3 dimensions, the picture that comes to mind is a “blob” with a lot of mass near the origin and exponential decay away from the origin. However, the picture is very different in higher dimensions (and illustrates the counter-intuitive nature of high-dimensional data analysis). In short, we will show that Gaussian distributions are like soap bubbles: most of the mass is concentrated near a shell of a given radius, and is empty everywhere else. a. Fix d = 3 and generate 10,000 random samples from the standard multi-variate Gaussian distribution defined in R d . b. Compute and plot the histogram of Euclidean norms of your samples. Also calculate the average and standard deviation of the norms. c. Increase d on a coarsely spaced log scale all the way up to d = 1000 (say d = 50, 100, 200, 500, 1000), and repeat parts (a) and (b). Plot the variation of the average and the standard deviation of Euclidean norm of the samples with increasing d. d. What can you conclude from your plot from part (c)? e. Bonus, not for grade. Mathematically justify your conclusion using a formal proof. You are free to use any familiar laws of probability, algebra, or geometry. 1 4. (20 points) The goal of this problem is to implement a very simple text retrieval system. Given (as input) a database of documents as well as a query document (all provided in an attached .zip file), write a program, in a language of your choice, to find the document in the database that is the best match to the query. Specifically: a. Write a small parser to read each document and convert it into a vector of words. b. Compute tf-idf values for each word in every document as well as the query. c. Compute the cosine similarity between tf-idf vectors of each document and the query. d. Report the document with the maximum similarity value. 5. (optional) How much time (in hours) did you spend working on this assignment? 21. (10 points) In class we derived the optimal linear predictor for scalar data, and wrote down the optimal lnear predictor for vector data (without proof). Here, let us derive a proof for the optimal linear predictor in the vector case. Suppose that {x1, x2, . . . , xn} denote training data samples, where each xi ∈ R d . Suppose {y1, y2, . . . , yn} denote corresponding (scalar) labels. a. Show that the mean squared-error loss function for multivariate linear regression can be written in the following form: MSE(w) = ky − Xwk 2 2 where X is an n × (d + 1) matrix and where the first column of X is all-ones. What is the dimension of w and y? What do the coordinates of w represent? b. Theoretically prove that the optimal linear regression weights are given by: wˆ = (XT X) −1XT y? What algebraic assumptions on X did you make in order to derive the closed form? 2. (10 points) In class, we argue that convexity together with smoothness is a sufficient condition for minimizing a loss function using gradient descent. Is convexity also a necessary condition for gradient descent to successfully train a model? Argue why or why not. You can intuitively explain your answer in words and/or draw a figure in support of your explanation. (Hint: What about f(x) = x 2 + 3 sin2 x?) 1 3. (20 points) The goal of this problem is to implement multivariate linear regression from scratch using gradient descent and validate it. a. Implement a function for learning the parameters of a linear model for a given tranining data with user-specified learning rate η and number of epochs T. Note: you cannot use existing libraries such as sklearn; you need to write it out yourself. b. Validate your algorithm on the glucose dataset discussed in Lecture 2. Confirm that the model obtained by running your code gets similar R2 values as the one produced by sklearn. 4. (10 points) In this lab, we will illustrate the use of multiple linear regression for calibrating robot control. The robot data for the lab is taken from TU Dortmund’s Multiple Link Robot Arms Project. We will focus on predicting the current drawn into one of the joints as a function of the robot motion. Such models are essential in predicting the overall robot power consumption. a. Read in the data in the attached exp_train.csv file; check that the data that you read actually corresponds to the data in the .csv file. In Python, you can use the commands given at the end of this document. b. Create the training data: • Labels y: A vector of all the samples in the ‘I2’ column • Data X: A matrix of the data with the columns: [‘q2’,‘dq2’,‘eps21’, ‘eps22’, ‘eps31’, ‘eps32’,‘ddq2’] c. Fit a linear model between X and y (using sklearn, or any other library of your choice). Report the MSE of your model. d. Using the linear model that you trained above, report the MSE on the test data contained in the attached exp_test.csv file. 5. (optional) How much time (in hours) did you spend working on this assignment? import pandas as pd names =[ ‘t’, # Time (secs) ‘q1’, ‘q2’, ‘q3’, # Joint angle ‘dq1’, ‘dq2’, ‘dq3’, # Joint velocity ‘I1’, ‘I2’, ‘I3’, # Motor current (A) ‘eps21’, ‘eps22’, ‘eps31’, ‘eps32’, # Strain measurements ‘ddq1’, ‘ddq2’, ‘ddq3’ # Joint accelerations ] df = pd.read_csv(‘exp_train.csv’, header=None,sep=’,’, names=names, index_col=0) df.head(6)1. (10 points) Suppose we wish to learn a regularized least squares model: L(w) = 1 2 Xn i=1 (yi − hw, xii) 2 + λR(w) where R(w) is a regularization function to be determined. Suggest good choices for R(w) if the following criteria need to be achieved (there are no unique correct answers) and justify your choice in a sentence or two: a. All parameters w are free to be determined. b. w should be sparse (i.e., only a few coefficients of w are nonzero). c. The coefficients of w should be small in magnitude on average. d. For most indices j, wj should be equal to wj−1. e. w should have no negative-valued coefficients. 2. (10 points) The Boston Housing Dataset has been collected by the US Census Service and consists of 14 urban quality-of-life variables, with the last one being the median house price for a given town. Code for loading the dataset is provided at the end of this assignment. Implement a linear regression model with ridge regression that predicts median house prices from the other variables. Use 10-fold cross validation on 80-20 train-test splits and report the final R2 values that you discovered. (You may want to preprocess your data to the range [0, 1] in order to get meaningful results.) 3. (10 points) In class, we discussed the lasso objective, where the regularizer was chosen to be the `1-norm. Here, we will derive an analytical closed form expression for the minimizer of a slightly simpler problem. Suppose x is a d-dimensional input and w is a d-dimensional variable. Show that the minimizer of the loss function: L(w) = 1 2 kx − wk 2 2 + λkwk1 1 is given by: w ∗ i = xi − λ if xi > λ, xi + λ if xi < −λ, 0 otherwise. 4. (20 points) In this problem, we will implement logistic regression trained with GD/SGD and validate on synthetic training data. a. Suppose that the data dimension d equals 2. Generate two clusters of data points with 100 points each (so that the total data size is n = 200), by sampling from Gaussian distributions centered at (0.5, 0.5) and (−0.5, −0.5). Call the data points xi , and label them as yi = ±1 depending on which cluster they originated from. Choose the variance of the Gaussian to be small enough so that the data points are sufficiently well separated. Plot the data points on the 2D plane to confirm that this is the case. b. (Derive your own GD routines; do not use sklearn functions here.) Train a logistic regression model that tries to minimize: L(w) = − Xn i=1 yi log 1 1 + e−hw,xii + (1 − yi) log e −hw,xii 1 + e−hw,xii using Gradient Descent (GD). Plot the decay of the training loss function as a function of number of iterations. c. Train the same logistic regression model, but this time using Stochastic Gradient Descent (SGD). Demonstrate that SGD exhibits a slower rate of convergence than GD, but is faster per-iteration, and does not suffer in terms of final quality. You may have to play around a bit with the step-size parameters as well as mini-batch sizes to get reasonable answers. d. Overlay the original plot of data points on the 2D data plane with the two (final) models that you obtained above in parts b and c to visualize correctness of your implementation. 5. (optional) How much time (in hours) did you spend working on this assignment? import pandas as pd import numpy as np from sklearn.datasets import load_boston boston_dataset = load_boston() boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names) boston[‘MEDV’] = boston_dataset.target boston.head() 21. (10 points) In class, we discussed how to represent XOR-like functions using quadratic features, since standard linear classifiers (such as perceptrons) are insufficient for this task. However, here we show that XOR-like functions can indeed be simulated using multi-layer networks of perceptrons. This example shows a glimpse of the expressive power of “deep neural networks”: merely increasing the depth from 1 to 2 layers can help reproduce nonlinear decision boundaries. a. Consider a standard two-variable XOR function, where we have 2-dimensional inputs x1, x2 = ±1, and output y = x1(XOR)x2 = ( −1 if x1 = x2 1 otherwise . Geometrically argue why a single perceptron cannot be used to simulate the above function. b. Graphically depict, and write down the equation for, the optimal decision region for the following logical functions: (i) x1(AND)(NOT(x2)) (ii) (NOT(x1))(AND)x2 (iii) x1(OR)x2 Make note of the weights learned corresponding to the optimal decision boundary for each function. c. Using the above information, simulate a multi-layer perceptron network for the XOR operation with the learned weights from Part (b). 2. (10 points) In this problem, we will implement the Perceptron algorithm discussed in class on synthetic training data. a. Suppose that the data dimension d equals 2. Generate two clusters of data points with 100 points each, by sampling from Gaussian distributions centered at (0.5, 0.5) and (−0.5, −0.5). Choose the variance of the Gaussian to be small enough so that the data points are sufficiently well separated. Plot the data points on the 2D plane to confirm that this is the case. b. Implement the Perceptron algorithm as discussed in class. Choose the initial weights to be zero and the maximum number of epochs as T = 100, and the learning rate α = 1. How quickly does your implementation converge? 1 c. Now, repeat the above experiment with a second synthetic dataset; this time, increase the variance (radius) of the two Gaussians such that the generated data points from different classes now overlap. What happens to the behavior of the algorithm? Does it converge? Show the classification regions obtained at the end of T epochs. 3. (10 points) In Lectures 2 and 5, we derived a closed form expression for solving linear and ridge regression problems. This is great for finding linear behavior in data; however, if the data is nonlinear, just as in the classification case, we have to resort to the kernel trick, i.e., replace all dot products in the data space with kernel inner products. In this problem, we theoretically derive kernel regression. Suppose we are given training data {(x1, y1),(x2, y2), . . .(xn, yn) where each response yi is a scalar, and each data point xi is a vector in d dimensions. a. Assume that all data have been mapped into a higher dimensional space using the feature mapping x 7→ φ(x), write down an expression for the squared error function using a linear predictor w in the high-dimensional space. b. Let Φ be the matrix with n rows, where row i consists of the feature mapping φ(xi). Write down a closed form expression for the optimal linear predictor w as a function of Φ and y. c. For a new query data point z, the predicted value is given by f(z) = hw, φ(z)i. Plug in the closed form expression for w from the previous sub-problem to get an expression for f(z). d. Suppose you are given black-box access to a kernel dot product function K where K(x, x0 ) = hφ(x), φ(x 0 )i. Mathematically show that all the calculations in (b) and (c) can be performed by invoking the kernel dot product alone without explicitly writing down φ(x) ever. You may want to use the Sherman-Morrison-Woodbury identity for matrices: (A −1 + B T C −1B) −1B T C −1 = ABT (BABT + C) −1 . 4. (20 points) The Fashion MNIST dataset is a database of (low-resolution) clothing images that is similar to MNIST but somewhat more challenging. You can load it in Colab using the Python code below. a. Load the dataset and display 10 representative images from each class. b. Implement the following classification methods: k-NN, logistic regression, and support vector machines (with linear and rbf kernels). You can use sklearn. c. Report best possible test-error performances by tuning hyperparameters in each of your methods. d. Report train- and test-running time of each of your methods in the form of a table, and comment on the relative tradeoffs across the different methods. import tensorflow as tf from tensorflow import keras fashion_mnist = keras.datasets.fashion_mnist (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() 21. (10 points) Consider a one-hidden layer neural network (without biases for simplicity) with sigmoid activations trained with squared-error loss. Draw the computational graph and derive the forward and backward passes used in backpropagation for this network. yˆ = W2σ(W1x), L = kyˆ − yk 2 2 Qualitatively compare the computational complexity of the forward and backward passes. Which pass is more expensive and by roughly how much? 2. (10 points) Suppose that a convolutional layer of a neural network has an input tensor X[i, j, k] and computes an output as follows: Z[i, j, m] = X k1 X k2 X n W[k1, k2, n, m]X[i + k1, j + k2, n] + b[m] Y [i, j, m] = ReLU(Z[i, j, m]) for some kernel W and bias b. Suppose X and W have shapes (48, 64, 10) and (3, 3, 10, 20) respectively. a. What are the shapes of Z and Y ? b. What are the number of input and output channels? c. How many multiply- and add- operations are required to perform a forward pass through this layer? Rough calculations are OK. d. What are the total number of trainable parameters in this layer? 3. (10 points) The use of `2 regularization for training multi-layer neural networks has a special name: weight decay. Assume an arbitrary dataset {(xi , yi)} n i=1 and a loss function L(w) where w represents the trainable weights (and biases). a. Write down the `2 regularized loss, using a weighting parameter λ for the regularizer. b. Derive the gradient descent update rules for this loss. 1 c. Conclude that in each update, the weights are “shrunk” or “decayed” by a multiplicative factor before applying the descent update. d. What does increasing λ achieve algorithmically, and how should the learning rate be chosen to make the updates stable? 4. (20 points) In this exercise, we will fine-tune a pre-trained deep network (ResNet34) for a particular two-class dataset which can be downloaded from the attached zip file. Code for pre-processing the dataset, and for loading ResNet34, can be found below. Since ResNet34 is for 1000 output classes, you will need to modify the last layer to reduce two classes. Train for 20 epochs and plot train and validation loss curves. Report your final train and validation accuracies. import torchvision from torchvision import datasets, models, transforms transforms = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) train_set = datasets.ImageFolder(“data/train”,transforms) val_set = datasets.ImageFolder(“data/val”,transforms) model = models.resnet34(pretrained=True) 21. (10 points) Assume that you have 4 samples each with dimension 3, described in the data matrix X, X = 3 2 1 2 4 5 1 2 3 0 2 5 For the problems below, you may do the calculations in python (or R or Matlab). Explain your calculations in each step. a. Find the sample mean. b. Zero-center the samples, and find the eigenvalues and eigenvectors of the data covariance matrix Q. c. Find the PCA coefficients corresponding to each of the samples in X. d. Reconstruct the original samples from the top two principal components, and report the reconstruction error for each of the samples. 2. (10 points) In class, we analyzed the per-iteration complexity of k-means. Here, we will prove that the k-means algorithm will terminate in a finite number of iterations. Consider a data set X = {x1, . . . , xn} ∈ R d . a. Show that the k-means loss function can be re-written in the form: F(η, µ) = Xn i=1 X k j=1 ηijkxi − µjk 2 where η = (ηij ) is a suitable binary matrix (with 0/1 entries). Provide a precise interpretation of η. b. Show that each iteration of Lloyd’s algorithm can only decrease the value of F. c. Conclude that the algorithm will terminate in no more than T iterations, where T is some finite number. Give an upper bound on T in terms of the number of points n. 1 3. (10 points) Using the Senate Votes dataset demo’ed in Lecture 11, perform k-means clustering with k = 2 and show that you can learn (most of) the Senators’ parties in a completely unsupervised manner. Which Senators did your algorithm make a mistake on, and why? 4. (20 points) The Places Rated Almanac, written by Boyer and Savageau, rates the livability of several US cities according to nine factors: climate, housing, healthcare, crime, transportation, education, arts, recreation, and economic welfare. The ratings are available in tabular form, available as a supplemental text file. Except for housing and crime, higher ratings indicate better quality of life. Let us use PCA to interpret this data better. a. Load the data and construct a table with 9 columns containing the numerical ratings. (Ignore the last 5 columns – they consist auxiliary information such as longitude/latitude, state, etc.) b. Replace each value in the matrix by its base-10 logarithm. (This pre-processing is done for convenience since the numerical range of the ratings is large.) You should now have a data matrix X whose rows are 9-dimensional vectors representing the different cities. c. Perform PCA on the data. Remember to center the data points first by computing the mean data vector µ and subtracting it from every point. With the centered data matrix, do an SVD and compute the principal components. d. Write down the first two principal components v1 and v2. Provide a qualitative interpretation of the components. Which among the nine factors do they appear to correlate the most with? e. Project the data points onto the first two principal components. (That is, compute the highest 2 scores of each of the data points.) Plot the scores as a 2D scatter plot. Which cities correspond to outliers in this scatter plot? f. Repeat Steps 2-5, but with a slightly different data matrix – instead of computing the base-10 logarithm, use the z-scores. (The z-score is calculated by computing the mean µ and standard deviation σ for each feature, and normalizing each entry x by x−µ σ ). How do your answers change? 2
Purpose The goals of this assignment are the following: • Learn about process creation and control in Linux environment • Get experience with the fork(), wait() and execl() system functions • Gain more experience with the C programming language from an OS perspective Assignment-1: Parent and Child Processes (100 points) Write a program in C that will perform the following tasks (must follow the task sequence/order below): 1. Your main program (i.e., parent process) will fork (create) a child process (e.g., child_1). 2. The parent process will wait for child_1 to complete before forking (creating) child_2. 3. child_1 will fork its own child_1.1 and wait for its completion. 4. child_1.1 will call an external program “external_program.out” (the source code of this file external_program.c will be provided to you) and must pass its PID concatenated with the string “for child_1.1” to the external program “external_program.out” (hint: execl()). As a result of this system call, child_1.1 will be replaced by external_program.out. The path to the external program “external_program.out” should be passed into the main program as a command line argument (hint: argc, argv). 5. After completion of child_1.1 process, child_1 should be completed, as no more jobs remain for child_1. 6. The parent will now fork child_2, and wait for the completion of child_2. 7. child_2 will make a call to the same external program “external_program.out”. child_2 must pass its PID concatenated with the string “for child_2” to the external program “external_program.out”. As a result of this system call, child_2 will be replaced by external_program.out (hint: execl()). The path to the external program “external_program.out” should be passed into the main program as a command line argument (hint: argc, argv). 8. The parent process will now terminate. The expected output from your program should look like the following: parent (PID 1655): process started parent (PID 1655): forking child_1 parent (PID 1655): fork successful for child_1 (PID 1656) parent (PID 1655): waiting for child_1 (PID 1656) to complete child_1 (PID 1656): forking child_1.1 child_1 (PID 1656): fork success for child_1.1 (PID 1657) child_1 (PID 1656): waiting for child_1.1 (PID 1657) to complete child_1.1 (PID 1657): calling an external program [external_program.out] From the external program: The PID was 1657 for child_1.1 child_1 (PID 1656): completed 2 parent (PID 1655): forking child_2 parent (PID 1655): fork successful for child_2 (PID 1658) parent (PID 1655): waiting for child_2 (PID 1658) to complete child_2 (PID 1658): calling an external program [external_program.out] From the external program: The PID was 1658 for child_2 parent (PID 1655): completed Hints: fork(), wait(), getpid(), getppid(), execl(), strcat() Mark Distribution This section describes a tentative allocation of marks assigned for the desired features. (100 points) a) A parent process will create two child processes: 20 points b) parent will wait for child_1 to complete before creating child_2: 15 points c) child_1 will create its own child child_1.1: 15 points d) child_1.1 will make a system call to an external program: 15 points e) child_2 will make a system call to an external program: 15 points f) parent process must not terminate until all child processes are completed: 20 points Computing Platform for Assignments You are responsible for ensuring that your program compiles and runs without error on the computing platform mentioned below. Marks will be deducted if your program fails to compile, or your program runs into errors on the specified computing platform (see below). • Students have virtual access to the MC 244 lab, which contains 30 Fedora 28 systems. Linux machines available to you are: linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca. • It is your responsibility to ensure that your code compiles and runs on the above systems. You can SSH into MC 244 machines. • If you are off campus, you have to SSH to compute.gaul.csd.uwo.ca first (this server is also known as sylvia.gaul.csd.uwo.ca, in honour of Dr. Sylvia Osborn), and then to one of the MC 244 systems (linux01.gaul.csd.uwo.ca to linux30.gaul.csd.uwo.ca). • https://wiki.sci.uwo.ca/sts/computer-science/gaul Provided Files • The source code for the external program “external_program.out” is provided to you as “external_program.c”. DO NOT make any changes to “external_program.c” • When running the program, you must provide the path to “external_program.out” as an argument (see Assignment 1 tutorial powerpoint) • If you have any questions, please contact the designated TAs or the Instructor 3 Assignment Submission You need to submit only one C file. The name of your submitted C file must be “assignment1.c”. Marks will be deducted if your submitted C file name is different. You must submit your assignment through OWL. Be sure to test your code on one of MC 244 systems (see “Computing Platform for Assignments” section above). Marks will be deducted if your program fails to compile or your program runs into errors on the computing platform mentioned above. Assignment 1 FAQ will be made available on OWL. Also, consult TAs, and the Instructor for any questions you may have regarding this assignment.Purpose The goals of this assignment are the following: • Get experience with the fork(), wait() and pipe() system functions. • Learn how to use pipe for bi-directional communication between parent and child process. • Gain more experience with the C programming language from an OS perspective. Inter-Processes Communications (100 points) Write a C program that will accept three strings from the user as command-line arguments (for example, X, Y, and Z). Your parent process will create a child. While the parent waits for the message to be available in the pipe, the child will access X (“CS”) and write it in the pipe. Then the parent will read the message from the pipe and access Y (“3305”). Now parent will concatenate the read message and Y (“CS 3305”), and write it in the pipe. This time child will read the message from the pipe and access Z (“is Fun!”). Then the child will concatenate the read message and Z (“CS 3305 is Fun!”), and write it in the pipe. Now the child will be completed, and the parent will read the message from the pipe. The expected output from your program should look like the following for the arguments “CS”, “3305”, and “is fun!”: parent (PID 1458): created child (PID 1459) child (PID 1459): received X = “CS” child (PID 1459): writing “CS” into pipe parent (PID 1458): read from pipe “CS” parent (PID 1458): received Y = “3305” parent (PID 1458): “CS” + Y = “CS 3305” parent (PID 1458): writing into pipe “CS 3305” child (PID 1459): read from pipe “CS 3305” child (PID 1459): received Z = “is Fun!” child (PID 1459): “CS 3305” + Z = “CS 3305 is Fun!” child (PID 1459): writing into pipe “CS 3305 is Fun!” child (PID 1459): all tasks completed parent (PID 1458): read from pipe “CS 3305 is Fun!” parent (PID 1458): all tasks completed Hints: fork(), wait(), pipe(), write(), read() 2 Mark Distribution This section describes a tentative allocation of marks assigned for the desired features. • Inter-Processes Communications (100 points) a) Child reads X & Z from Command Line: 10 points b) Parent reads Y from Command Line: 12 points c) A pipe is created for communication between parent and child: 20 points d) Child writes X into the pipe: 12 points e) Parent reads X from the pipe: 12 points f) Parent concatenates message and Y before writing into the pipe: 12 points g) Child concatenates message and Z before writing into the pipe: 12 points h) Output the correct string: 10 point NOTE: Marks will be deducted if error handling is not implemented in your code. Computing Platform for Assignments You are responsible for ensuring that your program compiles and runs without error on the computing platform mentioned below. Marks will be deducted if your program fails to compile, or your program runs into errors on the specified computing platform (see below). • Students have virtual access to the MC 244 lab, which contains 30 Fedora 28 systems. Linux machines available to you are: linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca. • It is your responsibility to ensure that your code compiles and runs on the above systems. You can SSH into MC 244 machines (please see the Assignment 1 file transfer tutorial). • If you are off campus, you have to SSH to compute.gaul.csd.uwo.ca first (this server is also known as sylvia.gaul.csd.uwo.ca, in honour of Dr. Sylvia Osborn), and then to one of the MC 244 systems (linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca) (please see the Assignment 1 file transfer tutorial). • https://wiki.sci.uwo.ca/sts/computer-science/gaul Assignment Submission You need to submit only one C file. The name of your submitted C file must be “assignment2.c”. Marks will be deducted if your submitted C file name is different. You must submit your assignment through OWL. Be sure to test your code on one of MC 244 systems (see “Computing Platform for Assignments” section above). Marks will be deducted if your program fails to compile, or your program runs into errors on the computing platform mentioned above. Assignment 2 FAQ will be made available on OWL, as needed. Also, consult TAs, and the Instructor for any questions you may have regarding this assignment.Purpose The goals of this assignment are the following: • Get experience with the pipe and pthread system functions. • Learn how to create multiple threads for different tasks. • Learn how to share data between threads using the pipe. • Gain more experience with the C programming language from an OS perspective. Inter-Thread Communications (100 points) Write a C program that will accept two integers from the user as command-line arguments (for example, X and Y where X ,Y are positive integers & X>Y). The parent process will read X and Y from the command line. The parent process will create three threads (i.e., pthread_t 1, pthread_t 2, and pthread_t 3). The parent process will write X and Y to the shared memory using pipe. The first thread (i.e., 1) will read X and Y from the pipe and perform the subtract, S = X-Y, and then the result S will be written to the pipe. Next, the second thread (i.e., 2) will read S from the pipe and determine whether S is a prime number, and then S will be written again to the pipe by the second thread. Finally, the third thread (i.e., 3) will read S from the pipe and reverse the number S. The expected output from your program should look like the following (for this example below, X and Y represent 33 and 4, respectively): 1. parent (PID 1729) receives X = 33 and Y = 4 from the user 2. parent(PID 1729) writes X = 33 and Y = 4 to the pipe 3. thread(TID 1) reads X = 33 and Y = 4 from the pipe 4. thread(TID 1) writes X-Y = 29 to the pipe 5. thread(TID 2) reads X-Y = 29 from the pipe 6. thread(TID 2) identified that 29 is a prime number 7. thread(TID 2) writes 29 to the pipe 8. thread(TID 3) reads X-Y = 29 from the pipe 9. thread(TID 3) reversed number is 92 In the above example, if S is NOT a prime number, then the phrase “identified that xx is a prime number” in line number 6 above must be replaced with the phrase “identified that xx is NOT a prime number”. You must control the execution of the threads to follow the sequence according to the expected output above. You must not use more than one pipe for this assignment. In case of passing multiple parameters through a single pipe, concatenate the parameters using any delimiter so that you can parse it later accordingly. This assignment will be tested given only positive integers where X > Y. Your implementation must have the following functions: 1. void *subtract(void *thread_id): This function is executed by thread 1. This function reads X and Y from the pipe, performs subtraction i.e., S = X-Y, and writes S to the pipe. 2 2. void *prime_check(void *thread_id): This function is executed by thread 2. This function reads S from the pipe and determines if S is a prime number or not. 3. void *reverse_num(void *thread_id): This function is executed by thread 3. This function reverses the number S. Mark Distribution This section describes a tentative allocation of marks assigned for the desired features. • Inter-Thread Communications (100 points) a) Parent reads X and Y from user: 10 points b) The first thread reads X and Y from pipe: 15 points c) The first thread writes results to the pipe: 10 points d) The second thread reads the result from the pipe: 10 e) The second thread identifies if it is a prime number: 15 f) The third thread reads the result from the pipe: 10 g) The third thread reverses the number: 15 h) Control the thread execution flow: 15 points You must pass the input to the program using the command line argument. The hardcoded input will not be accepted and considered for deducting marks accordingly. Also, marks will be deducted if error handling is not implemented in your code. Computing Platform for Assignments You are responsible for ensuring that your program compiles and runs without error on the computing platform mentioned below. Marks will be deducted if your program fails to compile or runs into errors on the specified computing platform (see below). • Students have virtual access to the MC 244 lab, which contains 30 Fedora 28 systems. Linux machines available to you are linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca. • It is your responsibility to ensure that your code compiles and runs on the above systems. You can SSH into MC 244 machines (please see the Assignment 1 file transfer tutorial). • If you are off-campus, you have to SSH to compute.gaul.csd.uwo.ca first (this server is also known as sylvia.gaul.csd.uwo.ca, in honor of Dr. Sylvia Osborn), and then to one of the MC 244 systems (linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca) [please see the Assignment 1 file transfer tutorial]. • https://wiki.sci.uwo.ca/sts/computer-science/gaul Assignment Submission You need to submit only one C file. The name of your submitted C file must be “assignment3.c”. Marks will be deducted if your submitted C file name is different. You must submit your assignment through OWL. Be sure to test your code on one of MC 244 systems (see “Computing Platform for Assignments” section above). Marks will be deducted if your program fails to compile or runs into errors on the computing platform mentioned above. Assignment 3 FAQ will be made available on OWL as needed. Also, consult TAs and the Instructor for any questions you may have regarding this assignment.Purpose: The main goal of this assignment is to gain hands-on experience with the CPU scheduling algorithms using C language in Linux environment. Performance Evaluation of CPU Scheduling Algorithms You will develop CPU Scheduling Algorithms using the C programming language in Linux. A sample input file is provided with the assignment, which must be used to develop the Round Robin (RR) CPU Scheduling Algorithm. Format of the Input File: You must use the same input file name (i.e., rr_input.txt) in your code. The input file contains several test cases for this assignment. Every line of the input file represents one individual test case. For example, if there are four lines inside the input file, it means your program must execute four different test cases. Every input line consists of several processes and their corresponding information, such as process name, arrival time (art_1), burst time (brt_1), and quantum time (q_time), and the following format is used: Input format: process_1 art_1 brt_1 process_2 art_2 brt_2 process_n art_n brt_n q_time Input example: p1 0 23 p2 1 3 p3 2 3 4 In the example input given above, there are three processes such as p1, p2, and p3. The arrival and burst time of process p1 is 0 and 23, process p2 is 1 and 3, and process p3 is 2 and 3. The quantum time is 4. The individual entries in the input line are separated by space. What you need to do: You must supply given rr_input.txt file to your program. At first, parse the input file line-wise, meaning every line in the input file is a test case. For every test case in the input file, apply Round Robin Scheduling and output the process schedule details. The output of your program for the sample test cases in the given input file must follow the format of the sample output given below. You must consider two decimal points for printing the fractional number; marks will be deducted otherwise. Sample Content in rr_input.txt : p1 0 24 p2 2 3 p3 6 3 4 p1 1 10 p2 2 5 p3 2 8 p4 6 9 2 Sample Output: Test case #1: p1 0 24 p2 2 3 p3 6 3 4 Number of Processes: 3, Quantum: 4 Process Scheduling Started: 2 CPU Time 0: [p1 Arrived] p1 [1/24] CPU Time 1: p1 [2/24] CPU Time 2: [p2 Arrived] p1 [3/24] CPU Time 3: p1 [4/24] CPU Time 4: p2 [1/3] CPU Time 5: p2 [2/3] CPU Time 6: [p3 Arrived] p2 [3/3] Process p2 completed with Turn Around Time: 5, Waiting Time: 2 CPU Time 7: p1 [5/24] CPU Time 8: p1 [6/24] CPU Time 9: p1 [7/24] CPU Time 10: p1 [8/24] CPU Time 11: p3 [1/3] CPU Time 12: p3 [2/3] CPU Time 13: p3 [3/3] Process p3 completed with Turn Around Time: 8, Waiting Time: 5 CPU Time 14: p1 [9/24] CPU Time 15: p1 [10/24] CPU Time 16: p1 [11/24] CPU Time 17: p1 [12/24] CPU Time 18: p1 [13/24] CPU Time 19: p1 [14/24] CPU Time 20: p1 [15/24] CPU Time 21: p1 [16/24] CPU Time 22: p1 [17/24] CPU Time 23: p1 [18/24] CPU Time 24: p1 [19/24] CPU Time 25: p1 [20/24] CPU Time 26: p1 [21/24] CPU Time 27: p1 [22/24] CPU Time 28: p1 [23/24] CPU Time 29: p1 [24/24] Process p1 completed with Turn Around Time: 30, Waiting Time: 6 Process scheduling completed with Avg Turn Around Time: 14.33, Avg Waiting Time:4.33 Test case #2: p1 1 10 p2 2 5 p3 2 8 p4 6 9 2 Number of Processes: 4, Quantum: 2 Process Scheduling Started: CPU Time 0: None CPU Time 1: [p1 Arrived] p1 [1/10] CPU Time 2: [p2 Arrived] [p3 Arrived] p1 [2/10] CPU Time 3: p2 [1/5] CPU Time 4: p2 [2/5] CPU Time 5: p3 [1/8] CPU Time 6: [p4 Arrived] p3 [2/8] CPU Time 7: p1 [3/10] CPU Time 8: p1 [4/10] CPU Time 9: p2 [3/5] 3 CPU Time 10: p2 [4/5] CPU Time 11: p4 [1/9] CPU Time 12: p4 [2/9] CPU Time 13: p3 [3/8] CPU Time 14: p3 [4/8] CPU Time 15: p1 [5/10] CPU Time 16: p1 [6/10] CPU Time 17: p2 [5/5] Process p2 completed with Turn Around Time: 16, Waiting Time: 11 CPU Time 18: p4 [3/9] CPU Time 19: p4 [4/9] CPU Time 20: p3 [5/8] CPU Time 21: p3 [6/8] CPU Time 22: p1 [7/10] CPU Time 23: p1 [8/10] CPU Time 24: p4 [5/9] CPU Time 25: p4 [6/9] CPU Time 26: p3 [7/8] CPU Time 27: p3 [8/8] Process p3 completed with Turn Around Time: 26, Waiting Time: 18 CPU Time 28: p1 [9/10] CPU Time 29: p1 [10/10] Process p1 completed with Turn Around Time: 29, Waiting Time: 19 CPU Time 30: p4 [7/9] CPU Time 31: p4 [8/9] CPU Time 32: p4 [9/9] Process p4 completed with Turn Around Time: 27, Waiting Time: 18 Process scheduling completed with Avg Turn Around Time: 24.50, Avg Waiting Time:16.50 Mark Distribution This section describes a tentative allocation of points assigned for the desired features. A. Supplying rr_input.txt to the program: 5 points B. Parsing the input file correctly: 10 points C. Printing the number of process and quantum for every test case: 5 points D. Detecting process arrival correctly: 15 points E. Detecting process completion correctly: 15 points F. Calculating Turn Around Time and Waiting Time for every process: 15 points G. Maintaining Round Robin sequence and quantum: 25 points H. Calculating Avg Turn Around Time and Avg Waiting Time for every test case: 10 points Computing Platform for Assignments You are responsible for ensuring that your program compiles and runs without error on the computing platform mentioned below. Marks will be deducted if your program fails to compile or runs into errors on the specified computing platform (see below). 4 • Students have virtual access to the MC 244 lab, which contains 30 Fedora 28 systems. Linux machines available to you are linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca. • It is your responsibility to ensure that your code compiles and runs on the above systems. You can SSH into MC 244 machines (please see the Assignment 1 file transfer tutorial). • If you are off-campus, you have to SSH to compute.gaul.csd.uwo.ca first (this server is also known as sylvia.gaul.csd.uwo.ca, in honor of Dr. Sylvia Osborn), and then to one of the MC 244 systems (linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca) (please see the Assignment 1 file transfer tutorial). • https://wiki.sci.uwo.ca/sts/computer-science/gaul Assignment Submission You need to submit only one C file. The name of your submitted C file must be “assignment4.c”. Marks will be deducted if your submitted C file name is different. You must submit your assignment through OWL. Be sure to test your code on one of MC 244 systems (see “Computing Platform for Assignments” section above). Marks will be deducted if your program fails to compile or runs into errors on the computing platform mentioned above. Assignment 4 FAQ will be made available on OWL as needed. Also, consult TAs and the Instructor for any questions you may have regarding this assignment.Purpose The goals of this assignment are the following: • Get hands-on experience in developing mutual exclusion/semaphore/critical section techniques and algorithms. • Gain experience with the C programming language from an OS’s process synchronization perspective. Assignment Description Using C programming language, you will develop a mutual exclusion algorithm for a process synchronization problem. You must ensure that your mutual exclusion algorithm guarantees only one process access to the critical section portion of your code at a given time. You are allowed to use any mutual exclusion/semaphore related C function calls. Description of the problem: Assume that there are a set of n bank accounts (n ≥ 1) shared by a set of x clients (x ≥ 1). Clients can perform two different types of transactions with each bank account: deposit and withdraw funds. If a particular transaction results in a negative account balance, the transaction should be ignored (i.e., an account balance should never be less than 0). Example and structure of the input file: In the following example, there are two bank accounts (account1 and account2) shared by a total of ten clients (c1 to c10). The clients are allowed to deposit money into both accounts and withdraw money from both the accounts. The initial balances of the accounts are specified in the input file. An input file is provided below for illustrative purposes. account1 balance 7000 account2 balance 4500 c1 deposit account1 100 withdraw account2 500 c2 withdraw account1 2500 … … c9 withdraw account1 1000 withdraw account2 500 c10 deposit account1 50 deposit account2 200 2 Illustration: (i) account1 balance 7000 The above line specifies the initial balance of account #1 as $7000 (ii) c1 deposit account1 100 withdraw account2 500 The above line specifies the operations performed by client #1. client #1 deposits $100 into account #1, then withdraws $500 from Account #2. The input file will be provided as “assignment_5_input.txt”, and it must be hard-coded inside your program as “assignment_5_input.txt”. A different input file will be used to evaluate your program for marking purposes where the structure of this input file name will remain the same, and only the data will be different. For every client, you must use a separate thread. The transactions for a unique client should be handled by a distinct thread. For appropriate synchronization among the threads (i.e., critical sections are protected against random access by the threads) you must use mutual exclusion-related system functions (such as pthread_mutex_lock(), pthread_mutex_unlock() etc. which will be discussed in lecture #14). You must output the balances of each bank account after all the transactions have been performed. For each bank account, your output should display the account followed by the account balance. For example: account1 balance 500 account2 balance 300 Your C program should output results to the screen. Sample Input and Output File You must not use any other format, such as user input from the terminal, command-line argument, etc., for providing input to the program. Also, you must not modify the format of the provided input file. The output of your program must follow the format of the sample output given above. For testing purposes, we have designed the sample input in a way where the final account balances will always remain the same after completing the transactions. However, when testing your program with another input file, note that due to the non-deterministic nature of threads, the final account balances may vary when running your program multiple times. Mark Distribution This section describes a tentative allocation of points assigned for the desired features. A. Supplying assignment_5_input.txt to the program: 10 points B. Parsing the input file correctly: 10 points C. Printing the correct number of accounts and clients: 10 points D. Using semaphore or mutual exclusion technique: 20 points E. Using a separate thread for each client: 25 points F. Correct balance of the accounts after all transactions: 25 points 3 Computing Platform for Assignments You are responsible for ensuring that your program compiles and runs without error on the computing platform mentioned below. Marks will be deducted if your program fails to compile or runs into errors on the specified computing platform (see below). • Students have virtual access to the MC 244 lab, which contains 30 Fedora 28 systems. Linux machines available to you are linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca. • It is your responsibility to ensure that your code compiles and runs on the above systems. You can SSH into MC 244 machines (please see the Assignment 1 file transfer tutorial). • If you are off-campus, you have to SSH to compute.gaul.csd.uwo.ca first (this server is also known as sylvia.gaul.csd.uwo.ca, in honor of Dr. Sylvia Osborn), and then to one of the MC 244 systems (linux01.gaul.csd.uwo.ca through linux30.gaul.csd.uwo.ca) (please see the Assignment 1 file transfer tutorial). • https://wiki.sci.uwo.ca/sts/computer-science/gaul Assignment Submission You need to submit only one C file. The name of your submitted C file must be “assignment5.c”. Marks will be deducted if your submitted C file name is different. You must submit your assignment through OWL. Be sure to test your code on one of MC 244 systems (see “Computing Platform for Assignments” section above). Marks will be deducted if your program fails to compile or runs into errors on the computing platform mentioned above. Assignment 5 FAQ will be made available on OWL as needed. Also, consult TAs and the Instructor for any questions you may have regarding this assignment.
CIFAR-10 is a dataset of 32×32 images in 10 categories, collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It is often used to evaluate machine learning algorithms. You can download this dataset from https:// www.cs.toronto.edu/~kriz/cifar.html. You can find a dataset dealing with European employment in 1979 at https://lib.stat.cmu.edu/DASL/Stories/EuropeanJobs.html. This dataset gives the percentage of people employed in each of a set of areas in 1979 for each of a set of European countries. Notice this dataset contains only 26 data points. That’s fine; it’s intended to give you some practice in visualization of clustering. Problem 2Do exercise 6.2 in the Jan 15 version of the course textQuestions about the homework Submission: Homework 5 submission details as for earlier homeworks.You may use any programming language that amuses you for this homework, BUT it’s much easier in R (just use lm)
In this assignment you will experiment with random variation over discrete events. It will be very helpful to use the analytical results and the experimental results to help verify the other is correct. If they do not align, you are probably doing something wrong (this is a very powerful and important thing to do whenever working with real data).As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory: https://www.cs.utah.edu/˜jeffp/teaching/latex/Consider a domain of size n = 3000. A: (5 points) Generate random numbers in the domain [n] until two have the same value. How many random trials did this take? We will use k to represent this value.B: (10 points) Repeat the experiment m = 300 times, and record for each how many random trials this took. Plot this data as a cumulative density plot where the x-axis records the number of trials required k, and the y-axis records the fraction of experiments that succeeded (a collision) after k trials. The plot should show a curve that starts at a y value of 0, and increases as k increases, and eventually reaches a y value of 1.C: (5 points) Empirically estimate the expected the number of k random trials in order to have a collision. That is, add up all values k, and divide by m.D: (10 points) Describe how you implemented this experiment and how long it took for m = 300 trials. Show a plot of the run time as you gradually increase the parameters n and m. (For at least 3 fixed values of m between 300 and 10,000, plot the time as a function of n.) You should be able to reach values of n = 1,000,000 and m = 10,000.Consider a domain [n] of size n = 100. A: (5 points) Generate random numbers in the domain [n] until every value i ∈ [n] has had one random number equal to i. How many random trials did this take? We will use k to represent this value.B: (10 points) Repeat step A for m = 300 times, and for each repetition record the value k of how many random trials we required to collect all values i ∈ [n]. Make a cumulative density plot as in 1.B.C: (5 points) Use the above results to calculate the empirical expected value of k.D: (10 points) Describe how you implemented this experiment and how long it took for n = 100 and m = 300 trials. Show a plot of the run time as you gradually increase the parameters n and m. (For at least 3 fixed values of m between 300 and 5,000, plot the time as a function of n.) You should be able to reach n = 20,000 and m = 5,000.A: (12 points) Calculate analytically (using formulas from the notes) the number of random trials needed so there is a collision with probability at least 0.5 when the domain size is n = 3000. (Show your work.) How does this compare to your results from Q1.C?B: (12 points) Calculate analytically (using formulas from the notes) the expected number of random trials before all elements are witnessed in a domain of size n = 100? (Show your work.) How does this compare to your results from Q2.C?Consider when the only random function you have is one that choses a bit at random. In particular rand-bit() returns 0 or 1 at uniformly random.A: (6 points) How can you use this to create a random integer number between 1 and n = 1024?B: (5 points) Describe a Las Vegas randomized algorithm (“Las Vegas” means: it may run for ever, but you can bound how long you expect it to take, and when it finishes you know it is done) for when n = 1000.C: (5 points) How many calls does this take in terms of n (say I were to increase the value n, how does the number of calls to rand-bit() change)? Keep in mind that in many settings generating a random bit is much more expensive that a standard CPU operation (like an addition, if statement, or a memory reference), so it is critical to minimize them.Consider a domain size n and let k be the number of random trials run. Let fi denote the number of trials that have value i. Note that for each i ∈ [n] we have E[fi ] = k/n. Let µ = maxi∈[n] fi/k. Consider some parameter δ ∈ (0, 1). How large does k need to be for Pr[|µ − 1/n| ≥ 0.01] ≤ δ? That is, how large does k need to be for all counts to be within 1% of the average with probability δ? How does this change if we want Pr[|µ − 1/n| ≥ 0.001] ≤ δ (for 0.1% accuracy)? (Make sure to show your work)In this assignment you will explore the use of k-grams, Jaccard distance, min hashing, and LSH in the context of document similarity.• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D1.txt • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D2.txt • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D3.txt • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D4.txt As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory: https://www.cs.utah.edu/˜jeffp/teaching/latex/You will construct several types of k-grams for all documents. All documents only have at most 27 characters: all lower case letters and space. [G1] Construct 2-grams based on characters, for all documents. [G2] Construct 3-grams based on characters, for all documents. [G3] Construct 3-grams based on words, for all documents. Remember, that you should only store each k-gram once, duplicates are ignored.A: (20 points) How many distinct k-grams are there for each document with each type of k-gram? You should report 4 × 3 = 12 different numbers.B: (20 points) Compute the Jaccard similarity between all pairs of documents for each type of k-gram. You should report 3 × 6 = 18 different numbers.We will consider a hash family H so that any hash function h ∈ H maps from h : {k-grams} → [m] for m large enough (I suggest over m ≥ 10,000).A: (25 points) Using grams G2, build a min-hash signature for document D1 and D2 using t = {10, 50, 100, 250, 500} hash functions. For each value of t report the approximate Jaccard similarity between the pair of documents D1 and D2, estimating the Jaccard similarity: JSˆ t(a, b) = 1 t X t i=1 ( 1 if ai = bi 0 if ai 6= bi . You should report 5 numbers. B: (5 point) What seems to be a good value for t? You may run more experiments. Justify your answer in terms of both accuracy and time.Consider computing an LSH using t = 100 hash functions. We want to find all documents which have Jaccard similarity above τ = .4.A: (8 points) Use the trick mentioned in class and the notes to estimate the best values of hash functions b within each of r bands to provide the S-curve f(s) = 1 − (1 − s r ) b f(s) = 1 − (1 − s b ) r with good separation at τ . Report these values.The values of r and b we mixed up before. You can report either, but please be clear which means the # of bands (in blue = r) and the # number of hashes per band (in blue = b). This is now consistent with the notes, but reverse of the book.B: (24 points) Using your choice of r and b and f(·), what is the probability of each pair of the four documents (using [G2]) for being estimated to having similarity greater that τ ? Report 6 numbers. (Show your work.)Describe a scheme like Min-Hashing for the Andberg Similarity, defined Andb(A, B) = |A∩B| |A∪B|+|A4B| . So given two sets A and B and family of hash functions, then Prh∈H[h(A) = h(B)] = Andb(A, B). Note the only randomness is in the choice of hash function h from the set H, and h ∈ H represents the process of choosing a hash function (randomly) from H. The point of this question is to design this process, and show that it has the required property. Or show that such a process cannot be done.In this assignment you will explore clustering: hierarchical and point-assignment. You will also experiment with high dimensional data.• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/C1.txt • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/C2.txt • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/C3.txtThese data sets all have the following format. Each line is a data point. The lines have either 3 or 6 tab separated items. The first one is an integer describing the index of the points. The next 2 (or 5 for C3) are the coordinates of the data point. C1 and C2 are in 2 dimensions, and C3 is in 5 dimensions. C1 should have n=20 points, C2 should have n=1004 points, and C3 should have n=1000 points. We will always measure distance with Euclidean distance.As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory: https://www.cs.utah.edu/˜jeffp/teaching/latex/There are many variants of hierarchical clustering; here we explore 3. The key difference is how you measure the distance d(S1, S2) between two clusters S1 and S2. Single-Link: measures the shortest link d(S1, S2) = min (s1,s2)∈S1×S2 ks1 − s2k2. Complete-Link: measures the longest link d(S1, S2) = max (s1,s2)∈S1×S2 ks1 − s2k2. Mean-Link: measures the distances to the means. First compute a1 = 1 |S1| P s∈S1 s and a2 = 1 |S2| P s∈S2 s then d(S1, S2) = ka1 − a2k2 .A (25 points): Run all hierarchical clustering variants on data set C1.txt until there are k = 4 clusters, and report the results as sets. Which variant did the best job, and which was the easiest to compute (think if the data was much larger)? Explain your answers.Point assignment clustering works by assigning every point x ∈ X to the closest cluster centers C. Let φC : X → C be this assignment map so that φC(x) = arg minc∈C d(x, c). All points that map to the same cluster center are in the same cluster.Two good heuristics for these types of cluster are the Gonzalez (Algorithm 9.4.1) and k-Means++ (Algorithm 10.1.2) algorithms.A: (20 points) Run Gonzalez and k-Means++ on data set C2.txt for k = 3. To avoid too much variation in the results, choose c1 as the point with index 1. Report the centers and the subsets for Gonzalez. Report: • the 3-center cost maxx∈X d(x, φC(x)) and • the 3-means cost P x∈X(d(x, φC(x)))2For k-Means++, the algorithm is randomized, so you will need to report the variation in this algorithm. Run it several trials (at least 20) and plot the cumulative density function of the 3-means cost. Also report what fraction of the time the subsets are the same as the result from Gonzalez.B: (20 points) Recall that Lloyd’s algorithm for k-means clustering starts with a set of k centers C and runs as described in Algorithm 10.1.1. • Run Lloyds Algorithm with C initially with points indexed {1,2,3}. Report the final subset and the 3-means cost. • Run Lloyds Algorithm with C initially as the output of Gonzalez above. Report the final subset and the 3-means cost. • Run Lloyds Algorithm with C initially as the output of each run of k-Means++ above. Plot a cumulative density function of the 3-means cost. Also report the fraction of the trials that the subsets are the same as the input.C: (10 points) Consider a set of points S ⊂ R d and d the Euclidean distance. Prove that arg min p∈Rd X x∈S (d(x, p))2 = 1 |S| X x∈S x. Here are some suggested steps to follow towards the proof (note there are also other valid ways to prove this, but, for instance, achieving some of these steps will get you partial credit): 1. First prove the same results for S ∈ R 1 .2. Expand each term (d(x, p))2 = (x − p) 2 = x 2 + p 2 − 2xp. 3. Add the above terms together and take the first derivative. 4. Show the results for each dimension can be solved independently (use properties of edge lengths in a right triangle).The k-median clustering problem on a data set P is to find a set of k-centers C = {c1, c2, . . . , ck} to minimize Cost1(P, C) = P p∈P d(p, φC(p)). We did not explicitly talk much about this formulation in class, but the techniques to solve it are all typically extensions of approaches we did talk about. This problem will be more open-ended, and will ask you to try various approaches to solve this problem. We will use data set C3.txt.A: (20 points) Find a set of 4 centers C = {c1, c2, c3, c4} for the 4-medians problem on dataset C3.txt. Report the set of centers, as well as Cost1(P, C). The centers should be in the write-up you turn in, but also in a file formatted the same was as the input so we can verify the cost you found. That is each line has 1 center with 6 tab separated numbers. The first being the index (e.g., 1, 2, 3 or 4), and the next 5 being the 5-dimensional coordinates of that center.Your score will be based on how small a Cost1(P, C) you can find. You can get 15 points for reasonable solution. The smallest found score in the class will get all 20 points. Other scores will obtain points in between. Very briefly describe how you found the centers.B: (5 points) Run your algorithm again for the 5-medians problem on dataset C3.txt. Report the set of 5 centers and the Cost1(P, C). You do not need to turn in a file for these, just write it in your report.Recall that the k-center problem is to find a set of k centers C to minimize Cost0(P, C) = max p∈P min c∈C d(p, c). Let C ∗ be the optimal choice of k centers for the k-center problem, and let V ∗ = Cost0(P, C∗ ). Prove that the Gonzalez algorithm always finds a set of k centers C such that Cost0(P, C) ≤ 2V ∗In this assignment you will explore finding frequent items in data sets, with emphasis on techniques designed to work at enormous scale.• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A4/S1.txt • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A4/S2.txt These data sets each describe a set of 2000 characters, separated by spaces. The order of the file represents the order of the stream.As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory: https://www.cs.utah.edu/˜jeffp/teaching/latex/A (20 points): Run the Misra-Gries Algorithm (see L12.2.2) with (k − 1) = 9 counters on streams S1 and S2. Report the output of the counters at the end of the stream. In each stream, from just the counters, report how many objects might occur more than 20% of the time, and which must occur more than 20% of the time.B (20 points): Build a Count-Min Sketch (see L12.2.3) with k = 10 counters using t = 5 hash functions. Run it on streams S1 and S2. For both streams, report the estimated counts for objects a, b, and c. Just from the output of the sketch, which of these objects, with probably 1 − δ = 31/32, might occur more than 20% of the time?C (5 points): How would these algorithms need to change (to answer the same questions) if each object of the stream was a “word” seen on Twitter, and the stream contained all tweets?D (5 points): Name one advantage of Count-Min Sketch over the Misra-Gries algorithm.The exact heavy-hitter problem is as follows: return all objects that occur more than 10% of the time. It cannot return any false positives or any false negatives. In the streaming setting, this requires Ω(min{m, n}) space if there are n objects that can occur and the stream is of length m.A: (1 point) A 2-Pass streaming algorithm is one that is able to read all of the data in-order exactly twice, but still only has limited memory. Describe a small space algorithm to solve the exact heavy hitter problem (say for φ = 10% threshold).In this assignment you will explore regression techniques on high-dimensional data.• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/A.dat • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/X.dat • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/Y.dat • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/M.dat • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/W.dat and a file stub: • https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/FD.mThese data sets are in matrix format and can be loaded into MATLAB or OCTAVE. By calling load filename (for instance load X.dat) it will put in memory the data in the file, for instance in the above example the matrix X. You can then display this matrix by typing X As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory: https://www.cs.utah.edu/˜jeffp/teaching/latex/First we will compute the SVD of the matrix A we have loaded [U,S,V] = svd(A) Then take the top k components of A for values of k = 1 through k = 10 using Uk = U(:,1:k) Sk = S(1:k,1:k) Vk = V(:,1:k) Ak = Uk*Sk*Vk’A (10 points): Compute and report the L2 norm of the difference between A and Ak for each value of k using norm(A-Ak,2) B (5 points): Find the smallest value k so that the L2 norm of A-Ak is less than 10% that of A; k might or might not be larger than 10. C (5 points): Treat the matrix as 1125 points in 30 dimensions. Plot the points in 2 dimensions in the way that minimizes the sum of residuals squared.Use the stub file FD.m to create a function for the Frequent Directions algorithm (Algorithm 16.2.1). We will consider running this code on matrix A.A (15 points): We can measure the error maxkxk=1 |kAxk 2 − kBxk 2 | as norm(A’*A – B’*B, 2). How large does l need to be for the above error to be at most kAk 2 F /10? How does this compare to the theoretical bound (e.g. for k = 0). Note: you can calculate kAk 2 F as norm(A, ’fro’)ˆ2.B (25 points): Frequent Directions should also satisfy another bound based on its Frobenious norm. We can compute AΠBk using Bk = B(1:k,:) and then calculating A * pinv(Bk’) * Bk A * pinv(Bk) * Bk. How large does l need to be to achieve kA − AΠBk k 2 F ≤ 1.1 · kA − Akk 2 F ; for each value k ∈ {1, 2, 3, 4, 5, 6, 7}. Answer both by running your algorithm and reporting the theoretical bound provided in the notes. (e.g., you should report 7 pairs of values, an empirical and theoretical bound for each value k)We will find coefficients C (was a1, . . . , ad in notes, but changed to avoid confusion with matrix A in Q1) to estimate X*C ≈ Y, using the provided datasets X and Y. We will compare two approaches least squares and ridge regression. Least Squares: Set A = inverse(X’ * X)*X’*Y Ridge Regression: Set As = inverse(X’*X + sˆ2*eye(12))*X’*YA (20 points): Solve for the coefficients C (or Cs) using Least Squares and Ridge Regression with s = {0.1, 0.3, 0.5, 1.0, 2.0} (i.e. s will take on one of those 5 values each time you try, say obtaining C2 for s = 2). For each set of coefficients, report the error in the estimate Yˆ of Y as norm(Y – X*C,2).B (20 points): Create three row-subsets of X and Y • X1 = X(1:66,:) and Y1 = Y(1:66) • X2 = X(34:100,:) and Y2 = Y(34:100) • X3 = [X(1:33,:); X(67:100,:)] and Y3 = [Y(1:33); Y(67:100)] Repeat the above procedure on these subsets and cross-validate the solution on the remainder of X and Y. Specifically, learn the coefficients C using, say, X1 and Y1 and then measure norm(Y(67:100) – X(67:100,:)*C,2). Which approach works best (averaging the results from the three subsets): Least Squares, or for which value of s using Ridge Regression?Consider a linear equation W = M*S where M is a measurement matrix filled with random values {−1, 0, +1} (although now that they are there, they are no longer random), and W is the output of the sparse signal S when measured by M. Use Orthogonal Matching Pursuit (as described in the notes) to recover the non-zero entries from S. Record the order in which you find each entry and the residual vector after each step.In this assignment you will explore different approaches to analyzing Markov chains.• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A6/M.dat These data sets are in matrix format and can be loaded into MATLAB or OCTAVE. By calling load filename (for instance load M.dat) it will put in memory the the data in the file, for instance in the above example the matrix M. You can then display this matrix by typing M As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory: https://www.cs.utah.edu/˜jeffp/teaching/latex/We will consider four ways to find q∗ = Mt q0 as t → ∞. Matrix Power: Choose some large enough value t, and create Mt . Then apply q∗ = (Mt )q0. There are two ways to create Mt , first we can just let Mi+1 = Mi ∗ M, repeating this process t − 1 times. Alternatively, (for simplicity assume t is a power of 2), then in log2 t steps create M2i = Mi ∗ Mi . State Propagation: Iterate qi+1 = M ∗ qi for some large enough number t iterations. Random Walk: Starting with a fixed state q0 = [0, 0, . . . , 1, . . . , 0, 0]T where there is only a 1 at the ith entry, and then transition to a new state with only a 1 in the i 0 th entry by choosing a new location proportional to the values in the ith column of M. Iterate this some large number t0 of steps to get state q 0 0 . (This is the burn in period.) Now make t new step starting at q 0 0 and record the location after each step. Keep track of how many times you have recorded each location and estimate q∗ as the normalized version (recall kq∗k1 = 1) of the vector of these counts. Eigen-Analysis: Compute eig(M) and take the first eigenvector after it has been normalized.A (20 points): Run each method (with t = 512, q0 = [1, 0, 0, . . . , 0]T and t0 = 50 when needed) and report the answers. B (10 points): Rerun the Matrix Power and State Propagation techniques with q0 = [0.1, 0.1, . . . , 0.1]T . For what value of t is required to get as close to the true answer as the older initial state?C (16 points): Explain at least one Pro and one Con of each approach. The Pro should explain a situation when it is the best option to use. The Con should explain why another approach may be better for some situation. D (4 points): Is the Markov chain ergodic? Explain why or why not. CS 6140 Data Mining; Spring 2014 Instructor: Jeff M. Phillips, University of UtahRepeat the trials in part 1.A above using taxation β = 0.85 so at each step, with probability 1 − β, any state jumps to a random node. It is useful to see how the outcome changes with respect to the results from Question 1. Recall that this output is the PageRank vector of the graph represented by M. Briefly explain (no more than 2 sentences) what you needed to do in order to alter the process in question 1 to apply this taxation.
This homework requires the student to design a simple application implemented in Java by applying the controller pattern, the iterator pattern and the composite pattern. When the application starts, it displays a window containing two buttons and a drawing area. The buttons are labeled “Circle” and “Box,” respectively. When one of the buttons is clicked and then the mouse is pressed in the drawing area, the corresponding shape is added to the collection of shapes of the composite. Moreover, all the shapes stored in the composite are repainted on the drawing area. Figure 1 shows a sample window with a circle and a box created.Figure 1: Sample window with a circle and box created 2 What to Produce and Submit The student is required to produce and submit the following: 1. Produce and submit a domain model class diagram (DMCD) for the simple application. The DMCD may contain only a few class but this is fine for this homework. 2. Produce and submit a design sequence diagram for the simple use case of the application. The design sequence diagram must apply the controller pattern, the iterator pattern and the composite pattern. It must also clearly separate the presentation (i.e., the graphical user interface) and the business objects. That is, all business-object related processing must take place in the business-object layer, not the presentation layer. Use UML stereotype or UML note to show the patterns you apply. 3. Produce and submit a design class diagram for the simple application. Use UML stereotype or UML note to show the patterns you apply. 1 4. Implement the design in Java. You may use Swing or AWT, whichever you prefer. Provide comments in your code to show the patterns you apply. 5. Compile and run your application. Add a circle, a box, another circle, and another box. Produce and submit the screen shot. 3 How To Submit To be announced by the TA before the deadline.1 Introduction This individual homework requires the student to modify homework 1 design and implementation. The enhancement requirements are as follows: 1. The software shall have a Hello-World button in addition to the Box and Circle buttons. The behavior of this button is the same as the Box and Circle buttons except that it draws a “Hello world” string in the drawing area. You decide the font and font size of the string. 2. The software shall support undo and redo. This requires you to add another two buttons to the button jpanel. You are required to apply the command pattern to support this functionality. 3. Apply the adapter pattern, if you had not done so in homework 1, to convert the various Java listener interfaces that you use to the interface of the controller. The Java listener interfaces may include an action listener’s actionPerformed(…) method, or a mouse listener’s mousePressed(…) method, etc. That is, each of the five buttons will have its own action listener, and the mouse pressed event will also have its own mouse listener. These listeners will adapt their respective listener interfaces to the respective methods of the controller. You may have done so for homework 1. If so, you just reuse your solution and apply it in homework 2. 2 What Needs to be Done and Submit? This homework requires the student to do the following: 1. Modify your domain model according to the new enhancement requirements. 2. Extend your design sequence diagram (DSD) to accommodate the new requirements and apply the command and adapter patterns. The patterns you applied in homework 1 must still be applied in homework 2. Use UML stereotype or UML note to show the patterns applied. 3. Modify your design class diagram, which must be derived from the modified design sequence diagram and modified domain model. Use UML stereotype or UML note to show the patterns applied. 4. Modify your Java implementation to support the new requirements and patterns applied. Provide comments in your code to show all of the patterns applied. 5. Compile and run your application. Arbitrarily add a few circles, boxes, and/or hello-world strings, as well as click the Undo and Redo buttons in between the drawing actions to test your software. Produce and submit screen shots to show the working of your software. 3 How To Submit To be announced by the TA before the deadline.Discuss the similarities and differences of the patterns in each of the following groups of patterns. Moreover, describe two situations that are not in the reference books or taught in class such that each pattern should be applied exclusively to exactly one situation. Describe how each pattern is applied and why the other pattern(s) must not be applied. 1. Command and object adapter 2. Flyweight, singleton and prototype 3. Bridge and strategy 1.1 What To Submit Submit answers in a Word or PDF document. The answers may include tables and figures if these help explain your solutions and facilitate understanding. However, only tables and figures are not enough, you must explain your solutions to help the reader understand. The document must not exceed eight pages of font size at least 11 points. These eight pages include a cover page (which is optional) and other items such as figures, diagrams and tables. 2 How To Submit To be announced by the TA before the due date.This individual homework requires the student to modify homework 1 design and implementation to apply the state pattern. In homework 1, a variable had been used to keep track of the state of the controller, and respond to the GUI events according to the state. This approach results in high complexity, especially when there are many states and many triggering events. This homework requires the student to apply the state pattern to reduce the complexity. 2 What Needs to be Done and Submit? This homework requires the student to do the following: 1. Produce a state machine to describe the state-dependent behavior of the controller. 2. Apply the state pattern to produce a design for above state-dependent behavior. That is, apply the state pattern to convert the state machine into a UML class diagram. 3. Modify homework 1 design class diagram to include the state pattern design produced above. Use UML stereotype or UML note to show the state pattern. 4. Modify homework 1 design sequence diagram (DSD) to include application of the state pattern. Use UML stereotype or UML note to show the state pattern. 5. Modify homework 1 Java code to apply the state pattern. Provide in-code comments to show the state pattern. 6. Compile, run and test your software. Produce and submit screen shots to show the working of your software. 3 How To Submit To be announced by the TA before the deadline.
For this homework assignment, you will be writing a “Hello World” Program, but with some extra lines of output, asking for user input, and performing arithmetic operations on ints and doubles.First, you must set up your Ubuntu environment (see class slides). Second, you must write a program, called “abc1234_Hello_World_1.cpp” (replace abc1234 with your netid) that does the following: • Say “Hello User” to the user when the program is created • Ask the user for their name • Take their name as input • Thank the user by name for telling you their name • Ask for two numbers (ints) • Show the Sum of these two numbers as an int • Show the Difference of these two numbers as an int (Num1 – Num 2) • Show the Product of these two numbers as an int • Show the Quotient of these two numbers as an int (Num 1 / Num 2) • Ask for two more numbers (doubles) • Show the Sum of these two numbers as a double • Show the Difference of these two numbers as a double (Num1 – Num 2) • Show the Product of these two numbers as a double • Show the Quotient of these two numbers as a double (Num 1 / Num 2) • Thank the user “by name” for their timeCreate a second file called “abc1234_Hello_World_2.cpp”. This will do the exact same thing as above, but with one main difference. You will not ask for the user’s name. Instead, you must figure out the user’s name by some other means. It is acceptable to be a user name, proper name, or some other identifying name for that person. The name cannot be hard-coded in the source file. Test your program in two different user accounts (create a 2nd account on your Ubuntu copy with different profile information) and see if it works.You will submit your code and screen shots of your code working via Blackboard. You will upload a ZIP file, named “abc1234_HW1.zip”, containing 2 files (5 if you did bonus): • abc1234_Hello_World_1.cpp (source code) • abc1234_Hello_World_1.png (screen shot of program running) • abc1234_Hello_World_2.cpp (source code) • abc1234_Hello_World_2a.png (screen shot of program running with first user) • abc1234_Hello_World_2b.png (screen shot of program running with second user) • When you submit your file, leave the GTA instructions on how to compile your code and run it. Note: he will be using terminal on the virtual machine that I used in class. Instructions can be included as a comment on blackboard or in a README.txt file. Full credit files named incorrectly result in a loss of 5 points each. Bonus files named incorrectly will result in a loss of 1 bonus point each. For information on taking screen shots and creating zip files in Ubuntu, see the class resources section in Blackboard.For this homework assignment, you will be given a text document with a secret code. The text document contains a series of lowercase words, each on a separate line. Some worlds will have letters replaced with uppercase letters or numbers. The secret code is all the capital letters put together and the sum of the numbers. Open the file Words.txt to be read Store each word in a vector Output the list of words to the terminal Go through the vector checking each character of each string If an uppercase letter, add it to the word If a number, add it to a sum of all numbers Output the secret code to terminal Save the secret code to a file called abc1234_Secret_Code_1.txt Append the secret code to the end of the text document that came with the homeworkThe bonus for this homework will work the same as the full credit but with two exceptions. The first is that you will use the file Bonus_Words.txt. The second that if you see two numbers next to each other, that is a two digit number, not 2 separate numbers.You will submit your code and screen shots of your code working via Blackboard. You will upload a ZIP file, named “abc1234_HW1.zip”, containing 2 files (5 if you did bonus): abc1234_Secret_Code_1.cpp (source code) Words.txt abc1234_Secret_Code_1.png (screen shot of program running) abc1234_Secret_Code_1.txt abc1234_Secret_Code_2.cpp (source code) Bonus_Words.txt abc1234_Secret_Code_2.png (screen shot of program running) abc1234_Secret_Code_2.txt When you submit your file, leave the GTA instructions on how to compile your code and run it. Note: he will be using terminal on the virtual machine that I used in class. Instructions can be included as a comment on blackboard or in a README.txt file. Full credit files named incorrectly result in a loss of 5 points each. Bonus files named incorrectly will result in a loss of 1 bonus point each.In this homework, you will create a class that stores a list of transactions and computes the average transaction for a given month. The user must input the name of the month and the transactions into the terminal. The Transaction_List Class First you will create two C++ files called abc1234_Transaction_List.h and abc1234_Transaction_List.cpp. Below is a UML class diagram that shows the basic design of the Transaction_List class.This class contains a list of transactions for a certain month. Each transaction is represented a double in the vector transactions. The constructor sets the month equal to the m and the num_transactions to 0. get_name returns the name of the month. add_transaction adds a new transaction to the list of transactions and 1 to num_transactions. average_transaction returns the average transaction for that month. If there are no transactions, then this should return 0. Testing the Transaction_List Next you will test your Transaction_List class by creating a new C++ file called abc1234_test.cpp, which consists of just a main method.The main function must be able to do the following three tests. Test #1 o Create a Transaction_List object named “May” and add 4 transactions: 150, 225.3, 78.9, and 523.99 o Get the Transaction_List’s name. If the month is not “May”, display a useful error message to the terminal (cout). o Compute the average transaction for the month. If the average is not 244.5475, display a useful error message to the terminal (cout). Test #2 o Create another Transaction_List called “March” and add 4 transactions: 150, 225.3, 78.9, and 523.99 o Get the Transaction_List’s name. If the month is not “May”, display a useful error message to the terminal (cout). o Compute the average transaction for the month. If the average is not 244.5475, display a useful error message to the terminal (cout). Test #3 o Create a Transaction_List object named “May” and add 4 transactions: 150, 225.8, 78.9, and 523.99 o Get the Transaction_List’s name. If the month is not “May”, display a useful error message to the terminal (cout). o Compute the average transaction for the month. If the average is not 244.5475, display a useful error message to the terminal (cout).Take a screenshot of your terminal that shows the output from all 3 tests. Name this screenshot abc1234_test.png. If you have multiple screenshots, name them abc1234_test_1.png, abc1234_test_2.png, etc. The main program Third, write one more C++ file called abc1234_main.cpp. This main program consists of a main function that should do the following: (you may have to write more loops then what is listed below, or more functions then just main) Create a list (vector) of Transaction_Lists Prompt the user for the name of a month and take it in as input. Create a Transaction_List object with the name you just received as input. Create a loop that askes for a transaction and takes it in as input. o If the input is -999, exit the loop. o If the input is not -999, add it to the list of transactions. When the loop is terminated, add this month to the list of Transaction_Lists. Ask the user if there is another month he/she want to put in (Y/N) o If yes, go back to the second bullet point. o If no, print out the name of each Transaction_List (the month name) and the average, each on a different line.. Take a screenshot of your code running in terminal. Name this screenshot abc1234_main.png. If you have multiple screenshots, name them abc1234_main_1.png, abc1234_main_2.png. etc. When grading this, you do not know how many months the GTA will put in, so make sure your code can handle different number of months. Makefile You are REQUIRED to provide a simple makefile that can build and execute your test code given the command “make test”, and build and execute your main given the command “make”. The sample course code from lecture 5 is a good example of a makefile to use for the homework. You will put all relevant files for the full credit portion of the homework in a folder called full_credit.Copy your abc1234_Transaction_List.h, abc1234_Transaction_List.cpp, and your abc1234_main.cpp into a separate folder called bonus_1. Change your Transaction_List Class so the double num_transactions is not mentioned in your code (delete all references to it). Find some way to still compute the average over the number of transactions submitted.Once that is complete, make a new makefile for this bonus. Run your main function take screen shots of your code, using the same naming conventions as mention above. Place those screen shots in the bonus_1 folder. Bonus 2 (15 pts) For this, you will be modifying the provided UML diagram in Umbrello. The provided UML class diagram is called Transaction_List.xmi.You are to do the following the UML diagram (double clicking the class diagram opens the properties window): Include a private vector called person. This list will correspond to the list of transactions and will represent who rang up each transaction. Modify the add_transaction() function to take in two arguments, one called high, one called p. The add day function will now add the input to the list of highs and name of the p into the list of person. Next, add a method called bonus. This function will return the name of the person who rang up the most transactions in that month. Since that person rang up the most transactions, that person will bet a bonus. You will save this UML diagram to a new folder called bonus_2. Copy your abc1234_Transaction_List.h, abc1234_Transaction_List.cpp, and abc1234_main.cpp file from your full credit directory into bonus_2. Next you will modify your Transaction_List class to reflect any changes made to the UML diagram.Lastly, modify your abc1234_main.cpp program to allow the user to input both the transactions and the names. If the transaction is still -999, quit the loop. When done taking input, output the name of the month, the average transaction, and who got the bonus, each on a different line. Take screen shots of your code working, and place those screen shots the bonus_2 folder. Follow the same naming conventions as above for the screenshots. Deliverables You will submit your code and screenshots via Blackboard. You will upload a zip file, named “abc1234_HW3.zip”, contain 1 folder (3 if you did all the bonus) full_credit folder o abc1234_Transaction_List.h o abc1234_Transaction_List.cpp o abc1234_test.cpp o abc1234_main.cpp o abc1234_test.png (or multiple if multiple screenshots were taken) o abc1234_main.png (or multiple if multiple screenshots were taken) o makefile o Instructions for compiling and running (either in comments via blackboard or in a README file) bonus_1 folder o abc1234_Transaction_List.h o abc1234_Transaction_List.cpp o abc1234_main.cpp o abc1234_main.png (or multiple if multiple screenshots were taken) o makefile o Instructions for compiling and running (either in comments via blackboard or in a README file) bonus_2 folder o abc1234_Transaction_List.xmi (the UML file) o abc1234_Transaction_List.h o abc1234_Transaction_List.cpp o abc1234_main.cpp o abc1234_main.png (or multiple if multiple screenshots were taken) o makefile o Instructions for compiling and running (either in comments via blackboard or in a README file) Full credit files named incorrectly result in a loss of 5 points each.In this homework, you will be making a different version of your Homework #3. You will be using a map to store Transaction objects (values) sorted by Date objects (keys). You will also be making a menu to allow a user to select different options and input data.You are to implement 3 classes: Date, Transaction, and Transaction List. Date represents a date, down to the second. It has 6 private variables: year representing the year, month representing the month, day representing day, hour representing the hour, minute representing the minute, and second representing second. There are also 4 public functions. The constructor takes in values for the private variables and assigns them. to_string() converts the data of the class to a string. operator