Cyber-Physical Systems Security (Summer 2025)The first step towards understanding the concepts of cyber-physical systems (CPS) security is to get familiar with their control logics and mechanisms. In the first mini project, we will learn how to model an industrial control system (ICS) with its control logic. To do so, an advanced educational modeling software, namely Factory IO coupled with Control IO, will be used to model the process and design the underlying controller. The latest trial version of Factory IO for Windows machines can be downloaded from here: Factory IO trial version (Alternate Link). You can use this version for 30 days.Factory IO requires Windows operating system (OS). If you do not have access to a Windows machine, you can download Windows from Georgia Tech Azure Dev Tools (formally Microsoft Imagine). If you have a Mac, then Parallels Desktop is a great way to set up the Windows virtual environment. The trial version of Parallels is sufficient for this mini project but you can also get a student discount for the full version. Otherwise, any virtual machine manager is allowed (e.g., VirtualBox). If you use VirtualBox, then you may need to install Guest Additions, so that GUI resolution auto-adjusts. Please note that Factory IO needs a GPU for better graphical rendering performance and a good amount of RAM. Therefore, make sure you have allocated enough hardware resources from your host machine to get a smooth environment. If you are still unable to get a smooth running experience, try to decrease the graphical rendering quality within the Factory IO settings. (Please avoid using VirtualBox if possible, it is not ideal for development).Factory IO requires heavy compute resources to run without any inexplicable problems and anomalies. A machine with the following specifications would be ideal for this project. These are not hard requirements, but will help improve your development experience. The listed specifications are based on statistics from previous classes:Inside Factory IO, there is a visual environment where you can create the 3-D model of an industrial control process. Also, you can design a controller for the process, where the actuators are controlled by sensors, input buttons, and the underlying control logic. The control logic of the process can be embedded inside a programmable logic controller (PLC) connected to your computer. We will learn how to program a PLC in the next mini project. Instead of connecting a real PLC, in this mini project, you will use the Control IO interface with Factory IO to design your controller. You can connect the block diagrams of the inputs, sensors, and actuators with different logic to form your controller. This is very similar to programming in the Simulink/Matlab environment. Factory IO can be very easy to learn, but for this project you are only working in Control IO.It does take some time to get familiar and comfortable with Factory IO and Control IO software. This project requires Function Block Diagram design which is a bit different from traditional software programming (there are no IF statements!). Please do not wait till the last weekend to start this project, as the estimated time of completion could be anywhere from 15 to 50 hours. Below are helpful resources to get started on Factory IO, Control IO, PLCs and Function Block Programming.Table 1: Helpful ResourcesTo better assist this assignment, a series of YouTube videos have been created to highlight the requirements for each part. Please note, the exact description and implementation in the videos may differ from the project requirements. It is your responsibility to follow the requirements specified in this document. Regardless, these videos should provide a high level understanding of what a working solution looks like.Table 2: YouTube VideosIn order to get started and view the Part 1 scene, unzip the attached project zip. Double-click on the scene file to open under the scenes directory (in this case part1.factoryio). In turn, this will start Factory IO with the scene loaded.If Factory IO is opened first, a blank Control IO file might open. If this happens, you should not use this Control IO, instead you will need to open the Control IO file associated with your work. The blank Control IO file will have a file name such as “Diagram1.controlio”. That is your indication to load your Control IO file. Make sure that the Control IO file you are editing is the correct file. Do not modify the 3D environment in Factory IO, as the same originally provided scene will be loaded during grading – not the one that you submit. The provided .controlio files contain a block used for a programmatic connection to Canvas for grading and is labeled as such – do not remove or adjust it!A logic bomb is a set of instructions incorporated into a program so that if some condition is met, the instructions will be executed. Typically, triggering a logic bomb causes a harmful change to the normal operation of the program. For the purpose of this assignment, the logic bombs will result in a change to the normal operation, but the change may or may not appear to be harmful to you. However, the plant owner may feel that the unauthorized change is harmful or might result in an unobserved downstream effect that she considers harmful.In this part, the goal is to design and implement a water tank system that fills and drains automatically. In the pre-built Factory IO scene, there are three water tanks. However, this part only requires working with one water tank (not all three). Below are the steps required for the implementation:This implementation has already been done and the solution is provided in the assignment zip file. Also, there is a Youtube Video created to discuss how to implement the solution. This is provided as reference and to assist with the learning curve required for working with Factory & Control IO. Note: The tank level targets in the YouTube video may differ from the requirements of this write-up. The levels specified in this write-up must be used for your project.This part is an extension to the first and will require using all three water tanks and adding a logic bomb. You may use your part1a.controlio file as a starting point for Part 1b. You should close both Factory IO and Control IO while you copy and rename the file to part1b.controlio and then re-open the program to continue programming.For Tank 1, once the water level reaches to 55% of the tank capacity, the fill valve must close and the drain valve must open and continue until the water level reaches 35% of tank capacity. At this point, the drain valve must close and the fill valve must open again to refill to 55%. This process must continue working in a repeated cycle until the stop button is pressed. By pressing the stop button, the remaining water in the tank must begin draining and then the entire process must stop once empty.For Tank 2, once the water level reaches 70% of the tank capacity, the fill valve must close and the drain valve must open and continue until the water level reaches 40% of the tank capacity. This process must continue working in a repeated cycle until the stop button is pressed. By pressing the stop button, the remaining water in the tank must begin draining and then the entire process must stop once empty.For Tank 3, once the water level reaches 90% of the tank capacity, the fill valve must close and the drain valve must open and continue until the water level reaches 55% of the tank capacity. This process must continue working in a repeated cycle until the stop button is pressed. By pressing the stop button, the remaining water in the tank must begin draining and then the entire process must stop once empty.After Tank 2 fills to 70% three times, the fill and drain logic of Tank 1 and 3 should be affected. As a result, your logic bomb should cause the Tank 1 tank to cycle between 14% and 16%. The cycle action is to drain to 14% and when 14% is reached, it fills to 16% and then drains to 14% and continues the 14-16% cycle. Tank 3 should fill to 100% capacity. If the stop button (or start button) is pressed after the logic bomb has triggered, Tank 2 should continue to respond accordingly, however, Tank 1 and 3 should NOT (i.e. ignore the start/stop buttons). Under no conditions shall the count for the logic bomb trigger be reset.Note: The tank level thresholds described here are nominal levels. As a tank fills and empties, its actual level at any point in the cycle may go just above a threshold, drop below, and then go back above due to tank splashing from the physics of the filling and emptying process. There should be a noticeable and intentional drop in tank level before a high level is considered to be reached for a subsequent cycle. In other words, if a high level threshold is 70%, dropping to 69% and then splashing back up to 70% or higher is not considered to be reaching 70% a second time. There must be an intentional draining of the tank, noticeable drop in level, and refilling of the tank before 70% can be triggered as a high level again.A pre-built Factory IO scene is provided and you are not allowed to modify the pre-built scene, as this will be the scene used for grading. Any points lost due to scene modifications are not eligible for a regrade (including any sensors/actuators that are already FORCED). This part will be graded under normal speed (1x). If you speed up the scene during programming and testing, ensure it still works at 1x speed! There are no requirements for connecting the different lights and control box displays for this part.In this section, the goal is to design and implement a pallet sorting and storage system. Below are the steps required for implementation:Table 3: Part 2 Control Box ConfigurationFor this part, the goal is to introduce a logic bomb into the implementation. You may use your part2a.controlio file as a starting point for Part 2b. You should close both Factory IO and Control IO while you copy and rename the file to part2b.controlio and then re-open the program to continue programming.The logic bomb should trigger once 12 pallets of any size box have been stored in the rack. In other words, 3 large / 4 medium / 5 small boxes have been stored, or 2 large / 8 medium / 2 small, etc. The logic bomb should cause the Stacker Crane to begin placing pallets where they have already been stored, in effect knocking some pallets off the rack. There should be no changes to the control box inventory display, so it should be reporting incorrect numbers since some of the pallets have been pushed out of inventory. Also, the system shall stop responding to start and stop buttons after the logic bomb has triggered. Under no conditions shall the count for the logic bomb trigger be reset, nor shall the logic bomb effect be cancelled once in place.A pre-built Factory IO scene is provided, and you are not allowed to modify the pre-built scene, as this will be the scene used for grading. Any points lost due to scene modifications are not eligible for a regrade (including any sensors/actuators that are already FORCED). This part will be graded under normal speed (1x). If you speed up the scene during programming and testing, ensure it still works at 1x speed!Create a zip file named –-mp1.zip (e.g., Tohid-Shekari-mp1.zip), that includes all your files and submit it on Canvas. Your zip file should be generated in such a way that when it is extracted, we get two folders with the names of scenes and controllers. If you make multiple submissions to Canvas, a number will be appended to your filename. This is inserted by Canvas and will not result in any penalty. Part 1a Control IO file was provided as an example and should not be turned in.Note: Failure to follow the submission and naming instructions will cause 20% points loss.A good way to ensure your scenes being submitted are no different than the original scenes is to compare the original project file hash and your submission file hash. In Windows Powershell, use Get-FileHash and compare it to Get-FileHash Please check your submission multiple times, as any submissions after the deadline will be subject to the late submission penalty per the syllabus. After making your final submission to Canvas, you should download your submission from Canvas and unzip to ensure all files are there and in the correct locations. There will be no exceptions to the late policy for any reason (for example, wrong file submitted, forgot to include a file, etc).The tables below are a rough estimation of how the TAs will approach grading. It is up to you to read through the requirements of the project and ensure all requirements are satisfied.Table 4: Part 1 Water Tank Logic Bomb (25 Points)Table 5: Part 1 Water Tank Logic Bomb (10 Points)Table 6: Part 2 Pallet Storage (50 Points)Table 7: Part 2 Pallet Storage Logic Bomb (15 Points)Extra Credit: To further encourage efficient programming diagrams, bonus points on this project will be granted to five students with the overall lowest block counts. Similar to the penalty thresholds, Notes, Memory Booleans, and Memory Integers will be excluded from this count. The student with the lowest block count will receive 5 additional points on this project. The second lowest will receive 4, third lowest 3, fourth lowest 2, and fifth lowest 1.In order to receive these points, your project must score perfectly in all sections and you must implement the control box button lights and rotating beacon warning lights where they exist in the scenes. In part 2, the control box button lights should illuminate to indicate the running status of the system (green if running, red if stopped).You may use the following python code snippet (execute in terminal/IDE) for enumerating the number of blocks in your code:Table 8: Box SizesTable 9: Box Size Storage Locations
Distance VectorTable of Contents PROJECT GOAL………………………………………………………………………………………………………………… 2 Part 0: Getting Started………………………………………………………………………………………………….. 2 Part 1: Files Layout……………………………………………………………………………………………………….. 2 Part 2: TODOs………………………………………………………………………………………………………………. 3 Part 3: Testing and Debugging……………………………………………………………………………………….. 4 Part 4: Assumptions and Clarifications……………………………………………………………………………. 4 Part 5: Correct Logs for Provided Topologies……………………………………………………………………. 6 Part 6: Spirit of the Project……………………………………………………………………………………………. 7 Part 7: FAQs…………………………………………………………………………………………………………………. 8 What to Turn In………………………………………………………………………………………………………………. 9 What you can and cannot share………………………………………………………………………………………… 9 Rubric…………………………………………………………………………………………………………………………… 10 In the lectures, you learned about Distance Vector (DV) routing protocols, one of the two classes of routing protocols. DV protocols, such as RIP, use a fully distributed algorithm to find shortest paths by solving the Bellman-Ford equation at each node. In this project, you will develop a distributed Bellman-Ford algorithm and use it to calculate routing paths in a network. This project is similar to the Spanning Tree project, except that we are solving a routing problem, not a switching problem.In “pure” distance vector routing protocols, the hop count (the number of links to be traversed) determines the distance between nodes. Some distance vector routing protocols, that operate at higher levels (like BGP), must make routing decisions based on business valuations. These protocols are sometimes referred to as Path Vector protocols. We will explore this by using weighted links (including negatively weighted links) in our network topologies.We can think of Nodes in this simulation as individual Autonomous Systems (ASes), and the weights on the links as a reflection of the business relationships between ASes. Links are directed, originating at one Node, and terminating at another.You should review some materials on Bellman-Ford. Some resources include:Download and unzip the Project Files for Distance Vector from Canvas in the Assignments section. This project can be completed in the class VM or on your local machine using Python3.10.x. You must be sure that your submission runs properly in Gradescope.The DistanceVector directory contains the following files:There are a few TODOs in DistanceVector.py:To run your algorithm on a specific topology, execute the run.sh bash script:./run.sh *TopoSubstitute the correct, desired filename for *Topo. Don’t use the .txt suffix on the command line. This will execute your implementation of the algorithm in DistanceVector.py on the topology defined in *Topo.txt and log the results (per your logging function) to *Topo.log .NOTE: You should not include the full filename of the topology when executing the run.sh script. For example, to run the algorithm on topo1.txt you should only specify topo1 as the argument to run.sh.For this project, you may create as many topologies as you wish and share them on Ed Discussion. We encourage sharing new topologies with log outputs. Topologies with format errors will get an error back when you try to run them.We’ve included four good topologies for you to use in testing and one bad topology to demonstrate invalid topology. The provided topologies do not cover all the edge cases; your code will be graded against more complex topologies.“advertise” other nodes it can reach (Nodes C and D). (partitioned networks) o topologies that do not require intermediate steps (such as a topology with a single node)Below are the correct final logs for the provided topologies. We are providing them to help you identify correct behavior with respect to negative cycles and the assumptions in the instructions. We are only providing the final round; each topology should produce at least 2 rounds of output. SimpleTopo:A:(A,0) (B,1) (C,3) (D,3) B:(B,0) (A,1) (C,2) (D,2) C:(C,0) (B,2) (A,3) (D,0) D:(D,0) (C,0) (B,2) (A,3) E:(E,0) (D,-1) (C,-1) (B,1) (A,2) SingleLoopTopo:A:(A,0) (D,5) (E,6) (B,6) (C,16) B:(B,0) (A,2) (D,7) (C,10) (E,0) C:(C,0) D:(D,0) (E,1) (B,1) (A,3) (C,11) E:(E,0) (B,0) (A,2) (D,7) (C,10) SimpleNegativeCycle: ComplexTopo:ATT:(ATT,0) (CMCT,-99) (TWC,-99) (GSAT,-8) (UGA,-99) (VONA,-11) (VZ,-3) CMCT:(CMCT,0) (TWC,-99) (ATT,1) (VONA,-10) (GSAT,-7) (UGA,-99) (VZ,-2) DRPA:(DRPA,0) (EGLN,1) (GT,-1) (UC,-1) (CMCT,-99) (TWC,-99) (ATT,13) (OSU,-1) (VONA,2) (GSAT,5) (UGA,-99) (PTGN,1) (VZ,10) EGLN:(EGLN,0) (GT,-2) (UC,-2) (DRPA,1) (CMCT,-99) (OSU,-2) (TWC,-99) (ATT,13) (PTGN,0) (VONA,3) (GSAT,5) (UGA,-99) (VZ,11) GSAT:(GSAT,0) (VONA,-3) (VZ,5) (UGA,-99) (ATT,7) (CMCT,-99) (TWC,-99) GT:(GT,0) (UC,0) (EGLN,2) (OSU,0) (DRPA,3) (PTGN,2) (CMCT,-99) (VONA,5) (TWC,-99) (ATT,15) (VZ,13) (GSAT,7) (UGA,-99) OSU:(OSU,0) (UC,0) (GT,0) (EGLN,2) (PTGN,2) (VONA,5) (DRPA,3) (VZ,13) (GSAT,7) (CMCT,-99) (ATT,15) (UGA,-99) (TWC,-99) PTGN:(PTGN,0) (OSU,-1) (UC,-1) (GT,-1) (EGLN,1) (VONA,3) (VZ,11) (GSAT,5) (DRPA,2) (ATT,13) (UGA,-99) (CMCT,-99) (TWC,-99) TWC:(TWC,0) (CMCT,-99) (ATT,1) (VONA,-10) (VZ,-2) (GSAT,-7) (UGA,-99) UC:(UC,0) (GT,0) (EGLN,2) (OSU,0) (PTGN,2) (DRPA,3) (VONA,5) (CMCT,-99) (VZ,13) (GSAT,7) (TWC,-99) (ATT,15) (UGA,-99) UGA:(UGA,0) (ATT,50) (CMCT,-99) (TWC,-99) (GSAT,42) (VONA,39) (VZ,47) VONA:(VONA,0) (VZ,8) (GSAT,2) (ATT,10) (UGA,-99) (CMCT,-99) (TWC,-99) VZ:(VZ,0) (ATT,2) (CMCT,-99) (TWC,-99) (GSAT,-6) (UGA,-99) (VONA,-9) The goal of this project is to implement a simplified version of a network protocol using a distributed algorithm. This means that your algorithm should be implemented at the network node level. Each network node only knows its internal state, and the information passed to it by its direct neighbors. Declaring global variables will be a violation of the spirit of the project.The skeleton code we provide you runs a simulation of the larger network topology. For simplicity, the Node class defines a link to the overall topology. This means it is possible using the provided code for one Node to access another Node’s internal state. This goes against the spirit of the project and is not permitted. If you have questions about whether your code is accessing data it should not, please ask on Ed Discussion or during office hours!You should not use any global variables for managing any data relating to the Nodes. However, you may use a global variable as a setting. I.E.: NEGATIVE_INFINITY = -99Q: May I import a python module into DistanceVector.py? For example, may I use import collections, typing, etc. A: Your solution should not require any outside Python modules. Please do not import any other modules.Q: What is the best way to format and process node messages? A: There is no right or wrong way to format messages. For best results keep things simple.Q: Is it required that the distance vectors displayed in my log files be alphabetized? A: Look at the finish_round function in Toology.py. Note how the DVs are alphabetized each round, and this is reflected in the provided correct output logs. The nodes within individual vectors are not required to be sorted.Q: Should my solution include an implementation of split horizon? A: That is not a requirement for this project.Q: What if there really is a valid path between two indirectly linked nodes with no cycle and the total cost is -99 or less? To complete this project, submit ONLY your DistanceVector.py file to Gradescope as a single file. Do not modify the name of DistanceVector. You can make an unlimited number of submissions to Gradescope. Your last submission will be your grade unless you activate a different submission. There are some very important guidelines for this file you must follow:Do not share the content of your DistanceVector.py file with your fellow students, on Ed Discussion, or elsewhere publicly. You may share any log files for any topology, and you may also share new topologies. Additionally, code that you write that is not required for turn-in, like testing suites may be shared. It may be a good idea to share a “correct” log for a particular topology, if you have one, when you share the code for that topology.When sharing log files, leave alphabetization on so that your classmates can use the diff tool to see if you are getting the same log outputs as they are.All work must be your own, and consulting Distance Vector Routing solutions, even in another programming language or just for reference, are considered violations of the honor code. Do not reference solutions on Github! Do not use IDE extensions (like Github Copilot) that write or recommend blocks of code to you (autocomplete for function names is fine). For more information see the Syllabus Definition of Plagiarism. We have worked hard to provide you with all the material you need to complete this project without help from Google/Stack Overflow (Searching basic Python syntax is fine). Don’t risk an honor code violation for the project. GRADING NOTE: There is no partial credit for individual topologies; each topology is either “passed” or “failed”.As with previous projects in this course, due to the size of the class, we will not accept resubmissions, modifications to old submissions past the deadline, etc.
To complete the assignment, you must complete the following tasks:https://github.gatech.edu/gt-omscs-se-2025summer/6300Summer25.git(If you do not know how to extract a “tar.gz” archive, you should be able to find plenty of resources on how to do that online–ask on Ed Discussion otherwise.)○ Directory /test contains, in a suitable directory, a template JUnit test class edu.gatech.seclass.MyStringTest.○ File /lib/junit-platform-console-standalone-1.9.1.jar, a JUnit library to be used for the assignment■ Placeholder instruction ○ Required from you: ■ Write a meaningful test case for the method, as identified by the placeholder instruction.■ Replace the text “” in the test comment with a concise description of the purpose of the test (e.g., “Count words in an empty string”) without modifying anything else in the comment and making sure that your comment consists of a single line (i.e., does not contain newlines) and does not contain quotes or other special characters.■ Make sure that every test method has a suitable oracle (i.e., either an assertion or an expected exception) and that the tests are not trivial (i.e., are not a copy of the provided one and have a specific purpose). In other words, each test should (1) test a specific piece of the functionality, and(2) check that such a piece of functionality behaves as expected.■ It should be obvious, but please also make sure that all the test cases you created pass when run against your code.■ In addition, at least two of the tests that you develop must result in an expected exception (e.g., NullPointerException). When testing for an expected exception, make sure to assert as follows:assertThrows(.class, () -> mystring.(..args)) ○ Consider, for example, test case testCountWordsS2(): In your submission, you should have a corresponding, complete test for method countWords in class MyString, together with a concise description of the purpose of the method, such as:@Test@Timeout(value = 5000, unit = TimeUnit.MILLISECONDS) // Description: Count words in an empty string public void testCountWordsS2() {}To submit your solution, you should:gpburdell181b2f59 As soon as you submit, your assignment will be graded by compiling your code and running it against both your test cases and a set of test cases, written by the instructors. All tests must pass.[1] After that, which should take a few minutes at most, you will see a report with your grade and some corresponding feedback. The feedback should be self-explanatory (e.g., “ERROR: You have 17 tests instead of 16.”), but please let us know otherwise.The grade you will see on Gradescope is your actual grade for the assignment, unless we find some issues with your submission (e.g., hardcoding of the results, identical tests, or similar).You can resubmit as many times as you want before the deadline, so you have a chance to address issues with your assignments if Gradescope finds an issue with your submission.○ Assignment3/test/edu/gatech/seclass/MyStringTest.java○ Assignment3/lib/junit-platform-console-standalone-1.9.1.jarFeel free to commit extra files (e.g., IDEA’s project-related files), as this does not make a difference for your grade.○ The already provided test cases except for those that you are supposed to implement, obviously, whose body is simply “fail(“Not yet implemented”)”.○ The test class name, the names of the test cases, and the test comments, except for replacing “” with a concise description of the purpose of the test. We understand that it may be advisable to use more meaningful names for the tests, but having numbered tests helps the grading. ○ The declaration of mystring in the test class.○ Go to directory Assignment3 in this fresh clone○ Compile your code. One way to do is to run, from a Unix-like shell:javac -cp lib/* -d classes src/edu/gatech/seclass/* test/edu/gatech/seclass/MyStringTest.java(on some platforms, you may need to first create directory “classes”)○ Run your tests. Again, from a Unix-like shell, you can run:java -cp classes:lib/* org.junit.platform.console.ConsoleLauncher–select-class edu.gatech.seclass.MyStringTest[3] Gradescope Feedback: The test case results that you see in Gradescope tell you whether a given test passed or not. If the test didn’t pass, Gradescope should show the difference between the expected and actual output. For ease of consumption, that difference may contain ellipses (“…”) to make the feedback more compact by omitting common parts of expected and actual output, and use square brackets to map corresponding parts of the output that differ in the expected and actual output.Requests for clarifications: If you need clarifications on a specific test or Gradescope output, please post privately on Ed Discussion (we will make it public if appropriate) and make sure to add, when it applies:The bottom line is that, to make the interaction efficient, you should make your posts as self-contained and easy-to-check as possible. The faster we can respond to the posts, the more students we can help.Although it is not mandatory, we recommend that you use IntelliJ IDEA to complete the assignment, so that you can also get familiar with this IDE (if you aren’t already). To do so, you should open IDEA and do the following (notice the initial, alternative steps): ○ Select as the directory to import ○ Select as the project directory ○○ Select “Create project from existing sources” ○ Click “Next” accepting the default choices (make sure to include the provided libraries and to select Java SDK 17) and then “Finish”:○○○○○ If SDK is not installed, choose to download: ○○ [1] Our test cases make sure that you implemented the functionality of the required methods correctly. They are fairly simple and are not trying to exercise arcane corner cases.[2] Internal Java libraries are those documented in the Java Platform Standard Edition API Specification.[3] If using a Windows-based system, you may need to instead run: java -cp “classes;lib/*”org.junit.platform.console.ConsoleLauncher –select-class edu.gatech.seclass.MyStringTest
The following is based on true events from 2014 which occurred in a small start-up company in Monroe, CT.A new small to medium sized eCommerce start-up based inMonroe, CThas recently begun to notice anomalies in their financial records. They have also recently received a number of customer complaints saying that invoices received from the companyoften bill more services than the customers asked for, and they are often directed to send check payments to a PO Box that does not look legitimate.Upon reviewing their most recent set of invoices (saved as plain text files), the accounting department notices that the information in the invoices is incorrect, as reported by the customer complaints. The CEO orders an immediate investigation of the three accountants’ computers, which had only recently been purchased last month from a bankrupt internet cafe popular for LAN parties.As a growing company, they only employ a small team of six part-time IT support professionals. hey undertake an initial check of the systems, and they find a suspicious process running on one of the computers which seems to be connected to a McAfee service with a large amount of data being sent outside the company firewall.At this point,they do not feel that they have the expertise to carry out a full-scale malware/forensic investigation, and becoming nervous of a possible attack, the IT team scribbles down the name of thesuspiciousprocess (“greencat-2”) and quickly shuts down allthree accounting department computers and unplugs their power cords.As there is increased competition in thestart-up eCommerce domain, the company is anxious to ensure that their systems have not been compromised. The CEO quickly employs a cyber forensic investigator (YOU) to determine what malicious activity has taken place, where it originated from, and if the company is to blame (i.e., could be sued) for any damages.Unfortunately, because theaccounting department computerscontain sensitive customer financial information, the CEO will only allow you access to thesuspicious greencat-2 binary. By investigating only this binary, can you determine:Instructions:Downloadwebc2-greencat-2.7z from Dropbox: https://www.dropbox.com/scl/fi/95t2w0nk2z749y ed1f3vb/webc2-greencat-2.7z?rlkey=e2740othz3n5v47dpxxrdb15l&dl=0Move this zip file to where you’re running GHIDRA , unzip it using the following 7z command. The 7z file is password protected. The password is “infected” (no quotes).7z e webc2-greencat-2.7zAfter you unzip the malware, check that the unzipped malware binary file’s MD5 hash is correct. GHIDRA will also show the MD5 hash after loading the binary, so double-check in GHIDRA as well.57e79f7df13c0cb01910d0c688fcd296 webc2-greencat-2Load the executable into GHIDRA . This malware is a standard 32-bit PE executable.Starting from theWinMain(…) function located at 0x0040297D, follow the control flow in thedisassembly. Imagine that you are a recursive-descentdisassembler — starting from the entry of each function and going until the return ofthatfunction.Comments should be placed on every block of logically-relatedinstructions. “Logically-related” is defined as 5 or fewer sequential instructions that together accomplish a single (easily commented) action. Control flow instructions (e.g., call, jmp) cannot be in the middle of a commented block — they can only be the first or last instruction of the block!Since this binary is dynamically linked, you will have to Google any library functions that you do not know. This will be important for understanding the return values and side effects (e.g., the memory copied by strcpy) of the library functions.It will be helpful if you keep track of the values in memory! For example: some library calls will save data into a buffer in memory (e.g.,GetUserNameExA) and then greencat-2 will use that data later on. GHIDRA can help with this — it shows empty space where static variables should be. You can double-click any static variable’s name to jump to its location andadd comments or dummy values. Unfortunately, GHIDRA cannot help with dynamically allocated variables. Youdo nothave to turn in anything related to memory tracking, but it will be helpful to you.Optional: Provide answers to the 3 investigation questions from above in a file called answers.txt and submit it with your lab submission in Gradescope. MAX of 5 sentences per answer. 5 bonus points per correct answer.As always: Post any questions or ideas on Ed Discussion! Even code snippets are fine, as long as they do not give away a key answer to this assignment. Class collaboration is encouraged — It’s us versus malware! If you’re not sure if your post is safe, send it to the Prof/TA in a private post to verify.Exporting Comments from GHIDRA-5 points for not following these steps properly.Default export settings in GHIDRA will truncate comments!Click on File and then click on Export program as shown in figure below.This will open a new pop-up in that choose ASCII format as shown in figure below.Click on options to increase the end of line width (max comment width).Grade: 100 pointsGrading Criteria:The grade will be based on how well you understood the functionality and capabilities of greencat-2.-10 points for each key malware functionality that your team fails to comment or entirely misunderstands (e.g., your comments say the code is reading a file, but it is really writing to a socket). All key malware functionality in greencat-2 are reachable via careful recursive-descent (there are no anti-disassembly tricks). I will not deduct points for misunderstanding single instructions.Bonus:5 bonus points will be awarded for each correct answer to the 3 investigation questions above.Teams:This assignment can be done individually or in a team of 2. Please join a group in Gradescope if you are collaborating.Do not create or join a group in Canvas. Canvas groups are different from Gradescope groups.New to Gradescope? This link provides instructions for how to create groups in Gradescope: https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-membersZoom can also provide the ability to collaborate and video conference with your teammate.If you would like to set it up, your team can collaborate using a GHIDRA server. The instructions for that can be found at the following blog post. Make sure to configure your server in a way that is not publicly available. This is outside the scope of this class, so we cannot provide IT support for GHIDRA servers, but please post any questions or tips on Ed!https://byte.how/posts/collaborative-reverse-engineering/For people new to GHIDRA and would like to get some insight, here is a blog post for that as well. https://byte.how/posts/what-are-you-telling-me-ghidra/Submission Instructions:Upload your commented ASCII Listing file exported from GHIDRA (called submission.txt) to the assignment in Gradescope.Note: Gradescope will only check the formatting of your submission. Gradescope will not automatically check for correctness and provide a grade.Upload your answers to the three questions in a file named answers.txt. Do not zip or compress the files.Note: Gradescope will only check the formatting of your submission. Gradescope will not automatically check the correctness and provide a grade.Note: You can download the webc2-greencat-2.7z file directly into your lab environment. After you are done with this lab, you can submit your files directly from the lab environment (Highly recommended). Doing this will help you avoid transferring the file from the the lab workspace to your personal computer.Transferring Files:To transfer files from your personal device to the lab environment:Create a zip folder of all the files that you would like to transfer to the lab environment.Every GT student has Box and OneDrive accounts given free by the institution. Login to either of those two and upload the desired files.Now go back to the lab environment and login to either of those two services where you uploaded you zip folder. Download folder to the the lab workspace and use the appropriate 7z command to unzip your folder.Additional ResourcesThe following video shows how you create header files and then load them into Ghidra to use for setting datatypes to whatever you defined. https://www.youtube.com/watch?v=u15-r5Erfnw&ab_channel=0x6d696368Is ChatGPT (or other AI assistance) allowed for this lab?The short answer is yes, ChatGPT or other AI assistance software is allowed for this assignment, since after all real practitioners will use all of the tools at their disposal to fight threat actors. Please keep in mind the following, however:Do not directly copy anything from the AI assistant chat window. It is crucial for you as the student to be able to articulate your understanding in your own words.ChatGPT among other AI assistants are notoriously bad at interpreting assembly, especially assembly that a malware author is intentionally trying to obfuscate for a human analyst! Do not blindly assume that the AI assistant’s interpretation is correct.Often AI assistant’s interpretation of assembly does not capture the true intention of the code, and instead regurgitates the actions of the CPU for each instruction. We are expecting your comments to encapsulate the high-level reason for the code’s existence, and it is not sufficient to explain which registers are storing which values without understanding the underlying reasoning.
In this assignment, you will get familiar with Git and some of its main features. The assignment assumes that you are working from the command line, and we will simulate the presence of multiple users using two terminal windows. If you are already familiar with Git and a specific Git client, feel free to use that, as long as you do what the assignment requires.In the following, we use the term “REPO” to indicate the personal repository we assigned to you, which is named: https://github.gatech.edu/gt-omscs-se-2025summer/6300Summer25.gitNotes: ● You may receive an email at your official Georgia Tech email address when we create your repository. ● If you have not received the above email within a day from the release of this assignment despite having completed Assignment 1, make sure to check your spam folder and that you are actually checking your GT email address. However, GT GitHub does not always send these emails. If you do not get one, you should still be able to log into https://github.gatech.edu/gt-omscs-se-2025summer or use the ‘GT GitHub’ link from Canvas and see the repo assigned to you. ● We will not be able to give you access to REPO until you activate your GT GitHub account, (i.e., you must have logged at least once into https://github.gatech.edu/).Please contact us on Ed Discussion in case of problems and, once more, make sure to use the repository we assigned to you and not one you created. All individual assignments for this class should be done in this assigned private repository only.Please also make sure to read the whole assignment before getting started and to follow the instructions we provide to the letter (e.g., use the exact (to the letter) commit messages provided in the assignment, rather than variations of them). If you don’t know how to accomplish a task, either consult Git’s help by running “git –help ” or leverage online resources (there are plenty, such as this Git cheat sheet or StackOverflow’s Git for beginners ). If you receive an error while executing a Git command, make sure to read the error message–Git often suggests exactly the right thing to do. Every time you ask the teaching staff to do something that has been clearly stated in this assignment that we won’t do for you, your points may be deducted.If you are not familiar with Git and feel completely lost, we suggest that you have a second look at the Git demo in P1L4 and mimic the steps in the demo on your machine, taking into account that the demo uses a standard GitHub and not a GT GitHub repository. In addition, you can also take the tutorial at http://try.github.io/levels/1/challenges/1, which should further get you familiar with Git’s basic concepts.We also suggest that you use GitHub’s “Network” view to monitor the state of your local and remote repository throughout the assignment. To do so, go to the GT GitHub page for your REPO, select menu “Insights” at the top, and then select the “Network” tab at the left. In this way, you will have a visual representation of how the repository evolves, which can be very useful for better understanding Git and how it operates. Other pages in GT GitHub show the list of commits and the tags (under ‘releases’). Furthermore, we provide a screenshot of how your repository should look at the end of each part of the assignment, which you can use as a reference. You can also run “git log –graph” from the command line, which outputs a text-based representation of the same information, but we do not provide a reference output for this representation.Finally, you may practice this assignment on a separate repository and then perform the assignment on REPO when you feel comfortable with the various commands. (This would also allow you to get started before you receive your official repo from us.) If you were to make a mistake while working on REPO and you wanted to restart from a clean slate, in most cases you should be able to do so by executing the repo reset instructions. Assignment Instructions: Preliminary note: We strongly recommend that you (1) clone your GitHub repo using HTTPS and (2) use a credential helper to tell Git to remember your GitHub username and password, so that you don’t have to re-login every time. You can find instructions on how to do so here: https://help.github.com/articles/caching-your-github-password-in-git/#platform-all In all steps, the quotation marks are only to designate specific text labels and commands, and are not part of the required text. Do not add any additional text to the files or commit messages other than what is specified. Copy and paste is suggested as you must match this document exactly. If a step cannot be completed as specified, such as git not allowing a merge commit when one is specified, you may have made an error earlier in the instructions. Part 1 (Terminal 1) Before you start, make sure to specify your name and email address using the command “git config”, if you haven’t already. 1. Open a terminal window 2. Create and go to directory User1 3. Clone REPO 4. This should create a directory called 6300Summer25 under directory User1. 5. Go to directory 6300Summer25 (here you can also open the Network view in GT GitHub and start monitoring how your repository evolves) 6. Make sure that the directory contains a file called MyPrivateRepo 7. Create and go to directory Assignment2 (under 6300Summer25) 8. Create a file called myinfo.txt that contains only one line with your first and last name. Ensure that the file extension is correct for this and later files. 9. Commit the file to your local repo with comment “Added myinfo file” 10. Create a branch called “development” and switch to it 11. Create a file called dev1.txt that contains the text “Dev 1 file”. 12. Commit the file to your local repo (it should be in branch “development”) with comment “Added dev1 file” 13. Switch to the “main” branch 14. Edit file myinfo.txt and add your GT Email (e.g., [email protected]) on the next line. 15. Commit the file to your local repo with comment “Edited myinfo file” 16. Merge the “development” branch into the “main” branch with commit message “Merge #1” 17. Push all branches to the remote repositoryAt this point, your remote repository should look like the one in the figure below in the “Network” view on GT GitHub. If it doesn’t, it means that you made a mistake in one of the steps.Part 2 (Terminal 2) 1. Open a second terminal window 2. Create and go to directory User2 3. Clone REPO 4. Just like before, this should create a directory called 6300Summer25 under directory User2 5. Go to directory 6300Summer25/Assignment2 6. Switch to the “development” branch 7. Create a file called dev2.txt that contains the text “Dev 2 file”. 8. Commit the file to your local repo (it should be in branch “development”) with comment “Added dev2 file” 9. Create a branch called “temp” and switch to it 10. Create a file called mytemp.txt that contains the text “Mytemp file”. 11. Commit the file to your local repo (it should be in branch “temp”) with comment “Added mytemp file” 12. Create and commit to branch “development”, with the comment “Added dev3 file”, a file called dev3.txt that contains the text “Dev 3 file”. 13. Merge the “temp” branch into the “development” branch with commit message “Merge #2” 14. Merge the “development” branch into the “main” branch with commit message “Merge #3” 15. In the main branch, edit file myinfo.txt and add the semester and year (i.e., “Summer 2025”) in a separate line. 16. Commit the file to your local repo with comment “Edited myinfo file again” 17. Push all branches to the remote repository.At this point, your remote repository should look like the one in the figure below in the “Network” view on GT GitHub. If it doesn’t, it means that you made a mistake in one of the steps.Part 3 (Terminal 1) 1. Go back to the first terminal 2. Make sure to be in branch “main” 3. Edit file myinfo.txt and add this text on the next single line of the file, using your own username: “6300Summer25” 4. Commit the file to your local repo with comment “Edited myinfo file for the third time” 5. Push the main branch (not the other branches) to the remote repository. To do so, you will have to pull changes from the remote repository and suitably handle conflicts. In handling conflicts, make sure not to lose any content, not to have any of the extra text added by Git to mark the conflicting parts, and to preserve the order of the information as it appears in the assignment on four separate lines: first and last name, GT email, semester, and repository name (i.e., 6300Summer25). Use commit message “Final merge with conflicts fixed” for the additional commit after merging and handling the conflicts. 6. Tag the current version of the main as “V1” and push the tag to the remote repository. Use a LIGHTWEIGHT tag. At this point you are done, and your remote repository should look like the one in the figure below in the “Network” view on GT GitHub. If it doesn’t, it means that you made a mistake in one of the steps. Given the fact that this assignment is fairly simple, as it mainly consists of executing the provided steps, we expect the network view to match the figure below exactly, with branch labels as shown and no missing or extra commits. We will deduct points if this is not the case.Similarly, we expect all the complete commit messages, branches, the folder, the lightweight tag, the tag name (V1), file names, file extensions (.txt), and the content of the files in the repo to be correct to the letter. Do not include extra labels, any default text, or additional content in your files, tags or messages. Review the network view, your commits list, the files, and the tag page.As stated above, if your network view differs from the one provided, it means that you made a mistake in following the provided instructions, so you should try solving the assignment again before posting screenshots of your network and asking for suggestions. Please keep in mind that this is a fairly simple assignment in which all you have to do is to perform the provided steps, so you should not really need any external help.Here are some additional views from GitHub that you can use to further check your submission:Submission To submit your solution, you should: ● Retrieve the commit ID for your submission by running “git log -1” in Terminal 1 after the last (successful) command–the commit ID is the long hexadecimal number after “commit”. ● Submit on Gradescope a file, called submission.txt that contains, in two separate lines (1) your GT username and (2) the commit ID for your submission. For example, the content of file submission.txt for George P. Burdell could look something like the following: submission.txt gpburdell1 81b2f59 ● As soon as you submit, your assignment will be graded by checking that your graph is correct and that you created the right files with the right content. After that, you will see a report with your grade and some corresponding feedback. The feedback should be self-explanatory, but please let us know if something is not clear. ● The grade you will see on Gradescope is your actual grade for the assignment. ● You can resubmit as many times as you want before the deadline, so you have a chance to address any issue Gradescope finds with your submission. ● Note: the autograder for this assignment is a fairly sophisticated piece of code that we developed recently. Please let us know if you observe any unexpected behavior. Repo Reset Instructions:● IMPORTANT: ○ These are instructions that you should follow only if you mess up your REPO (eg. you have an incorrect commit message, or commited the wrong thing at the wrong time). Please don’t ask us to reset your repo, as we would simply run the same instructions described below, which is something you can do without our help. ○ This is one of the few cases in which we recommend the use of the “git push –force” command, which we otherwise strongly discourage, as it can have disastrous effects. Do not use these commands on your repository after this assignment, or any similar commands that may remove history, remove branches, remove tags, or change prior commit IDs, including “–force” or “-f” commands. There is no need to clean up your repository at any point after this assignment is completed. ○ Some of the above commands may fail if you haven’t yet done what they are trying to undo. These errors can be safely ignored. ● Run “git log” from within your local repository ● Get the last commit ID in the list (i.e., the one with the earliest date, which should have “Initial commit.” as its associated comment.) ● Run in each of the two terminals (one for User1 directory and the other one for User2 directory), and from the root of your repo (directory 6300Summer25) ○ git checkout main ○ git branch -D development temp ○ git push origin :development ○ git push origin :temp ○ git tag -d V1 ○ git push origin :V1 ○ git reset –hard ○ git push –force ● After resetting the state, there is no need to re-clone the repository (i.e., there is no need to perform the first 5 steps of the instructions).
User Stories User InterfaceThe Arriba Eats company has plans to eventually develop a graphical user interface for this application; however, the prototype implementation will have a simple text-based interface. To make it easier to switch to a graphical user interface in the future, we want to ensure the user interface code is kept separate from the application logic.The detailed specification for this application’s user interface is available here: CAB201 OO Assignment – Interface Specification DatabaseThe Arriba Eats company plans to eventually use a relational database to store and save all customer data. However, for this prototype, all data will be kept in memory (and will therefore be lost when the application is shut down). Recommended ApproachYou are encouraged to design and develop the application using an iterative and incremental approach — implementing one user story at a time. You can implement the user stories in any order you choose; however, for each new user story, you should design and implement only the classes and methods required to implement that user story.The following points need to be emphasised.When producing your solution, you may need to first refactor some of the classes and methods that you have already implemented for earlier user stories to better facilitate the new user story. After each user story, you should have working code that achieves all of the user stories considered so far.If your submitted code does not compile, then you will receive 0 marks for functionality. RestrictionsThe following restrictions will apply to your input. In addition to the above, the following restrictions will apply: Test CasesAssignment 2 – Inputs and Outputs.zipDownload Assignment 2 – Inputs and Outputs.zipWe have provided a set of input and output text files that you can use to develop your assignment. These files are exact duplicates of the Gradescope test cases. The input files capture your typed input, such as menu responses and names. The output files record the results that would typically be displayed on the screen.This will enable you to test each case individually, rather than going through Gradescope every time. However, one you have implemented each test successfully, we recommend resubmitting to Gradescope to make sure that your submission does not introduce errors to cases that you have previously passed.There are two ways you can use these files.executable < directory/inputfileor if you are using Powershell use:Get-Content directory/inputfile | executableWhere:On Windows using Visual Studio 2022, follow these steps:C:cd “C:CAB201ArribaEatsbinDebug et8.0” C:CAB201ArribaEatsbinDebug et8.0>C:cab201ArribaEatsbinDebug et8.0>ArribaEats < “Inputs1. Startup Menu.txt” Welcome to Arriba Eats! Please make a choice from the menu below: 1: Login as a registered user 2: Register as a new user 3: Exit Please enter a choice between 1 and 3: Thank you for using Arriba Eats!C:cab201ArribaEatsbinDebug et8.0>Your screen will display the output. You can compare this output with the corresponding Gradescope output either by manually inspecting the corresponding file in the RefOutputs folder, or by redirecting the output of your program into a file and then using the fc command:C:cab201ArribaEatsbinDebug et8.0>ArribaEats < “Inputs1. Startup Menu.txt” > MyOutput.txtC:cab201ArribaEatsbinDebug et8.0>fc “RefOutputs1. Startup Menu.txt” MyOutput.txt Comparing files REFOUTPUTS1. Startup Menu.txt and MYOUTPUT.TXT FC: no differences encounteredRemember, anytime you enter invalid input, the program will loop until valid data is provided. This could potentially lead to an infinite loop if your implementation handles errors incorrectly.Test Cases Code QualityWhile it is possible to write poor quality code that passes all of the autograder tests, it is also important to write high-quality code. Below is a non-exhaustive list of how each test can help improve your code quality. Please note that this list will help you write better code, but you will still need to apply other code quality techniques discussed during the semester to achieve a high mark (e.g. object-oriented class design, visibility, SOLID design) . Code SmellsCode smells are issues that indicate poor code quality. While the code may be functional and pass Gradescope tests, you may be marked down when your code quality is assessed. Note that the list is diverse: some smells can be fixed after your first attempt at writing code, but others should be avoided from the start. Some are simple issues, while others are more serious. Refactoring is a good way of addressing them (https://refactoring.guru/refactoringLinks to an external site., https://refactoring.com/Links to an external site.).Gradescope Smells Marking CriteriaThere are of 50 marks available for this assessment item. Your submission will be marked with a mix of automated and manual processes.Autograder: 20 marksWhen you submit your code to Gradescope, it will be automatically run against a battery of test cases, checking many different combinations of customers, restaurants, restaurant items, delivery drivers and orders, testing different paths through your program. These have been designed to vary in complexity, ensuring that you can pass some test cases even with a very early implementation. For this reason, it is recommended that you submit to Gradescope early and often, to ensure that your code is successfully compiling and running on our grading platform.Each test case has the same structure, where you will be presented with a menu and you have to enter input to navigate it.There are also a handful of test cases that purposefully provide invalid input. Your program will receive full marks for that test case if it behaves correctly in that situation, and no marks otherwise.It is important that your program produces the same expected output as Gradescope. Variations in the output formatting (i.e. the messages your program prints out) may cause Gradescope to misinterpret your input and cause you to fail test cases that you would have otherwise passed.Make sure you use Console.ReadLine() for reading all text – other Console functions for reading keys may not work with Gradescope, even though they might seem to be fine when run on your machine. Every input is on a line by itself and terminated by pressing the enter key, and that goes for selecting options from the menu as well as text and numeric inputs.One last note: While we do not vary the test cases from run to run, if we discover during manual inspection of your program that you have hard-coded in our test cases and results (that is, instead of implementing the restaurant ordering and delivery logic, you have built your program to specifically recognise our test cases and produce the output we expect) you will receive a score of 0 for functionality, and likely a poor score from the object-oriented design and implementation, since it will not be considered useful work. Abstraction: 7 marksYour use of abstraction is marked by analysing your source code. Under this criterion, the following will be penalised:In general, we are looking for anything other than code clarity and your OOP design (as those are marked in their own criteria) that makes your code more difficult to maintain. Object-oriented design and implementation: 10 marksYour object-oriented design and implementation will be checked by looking at your class inheritance hierarchy and examining source code to check your use of polymorphism and encapsulation, as well as the private/protected/internal/public state of your methods, fields and properties. Coupling and cohesion: 8 marksYour coupling and cohesion will be checked by examining class interdependencies and looking at the methods/fields/properties present in each class.As a refresher, coupling is something you are trying to minimise (it means unhealthy dependencies between classes that make your code less reusable), while cohesion is something you are trying to maximise (it means each class is responsible for a single unified task and all methods/fields/properties support that task). Code clarity and comments: 5 marksCode clarity is marked by looking at how your choice of identifiers (names of classes, methods, properties, fields, local variables) and your program’s flow (use of loops / branching) affects the comprehensibility and maintainability of your code. For comments, we are looking for two different kinds: top-level XML tag comments (///) and body-level comments (usually single-line // comments).
Vikranth Udandarao 2022570 1. To compare the manual convolution with the result from np.convolve(), both were computed and plotted on the same graph. The manual convolution was performed using nested loops to apply the convolution sum directly, while np.convolve() used NumPy’s built-in convolution function, which is optimized for performance. In the plot, we differentiate the two methods by using distinct colors and line widths: the manual convolution is shown with a thicker blue line, and the np.convolve() result is depicted with a thinner red line. Both methods yield nearly identical results, which visually overlap on the plot, indicating that the manual approach correctly implements the convolution process.2.3. . Prominent Frequency Components: ○ The FFT magnitude spectrum clearly shows two primary peaks: ■ One at 5 Hz. ■ The second at 12 Hz. ○ These frequencies correspond exactly to the frequencies f1=5 Hz and f2=12 Hz that were used to construct the signal in the time domain. Amplitudes of the Peaks: ○ The peak at 5 Hz has a magnitude of approximately 32. This is consistent with the fact that the cosine term has an amplitude of 1. In the FFT, the magnitude of this peak is scaled by the number of samples NNN, which results in a larger peak. ○ The peak at 12 Hz has a magnitude of approximately 16. This corresponds to the second cosine term where the amplitude is 0.5. The FFT shows a peak half as large as the one at 5 Hz, which matches the expected amplitude scaling. Symmetry of the Spectrum: ○ The full FFT result (first plot) is symmetric around 0 Hz, with peaks appearing at both positive and negative frequencies. This is a direct consequence of the fact that the signal x[n]x[n]x[n] is real-valued, leading to conjugate symmetry in the frequency domain. ○ The second plot focuses on the positive frequencies, showing only the significant frequency components at 5 Hz and 12 Hz.4. …The Gaussian pulse x(t) is a smoothly varying signal in the time domain. Its magnitude spectrum, obtained through the Fast Fourier Transform (FFT), reveals several important characteristics about the frequency content of the pulse. Gaussian Shape in the Frequency Domain: a. The FFT of a Gaussian pulse results in another Gaussian-shaped magnitude spectrum in the frequency domain. This property arises because the Fourier Transform of a Gaussian function is also a Gaussian. The spectrum is centered at 0 Hz, reflecting that the pulse is a low-frequency signal with most of its energy concentrated around zero frequency. Dominance of Low-Frequency Components: b. The bell-shaped spectrum shows that the energy of the Gaussian pulse is primarily concentrated at low frequencies. The amplitude decreases rapidly as we move away from 0 Hz. This indicates that the Gaussian pulse is composed mainly of low-frequency components, which corresponds to its smooth and continuous nature in the time domain. Width of the Frequency Spectrum: c. The width of the magnitude spectrum depends on the parameter σsigmaσ, which controls the width of the Gaussian pulse in the time domain. In this case, with σ=0.1sigma = 0.1σ=0.1, the spectrum is relatively wide in the frequency domain. According to the time-frequency uncertainty principle, a narrower pulse in the time domain (small σsigmaσ) results in a broader frequency spectrum. d. If σsigmaσ were larger, the time-domain pulse would be wider, and the corresponding frequency spectrum would be narrower, concentrating more energy near 0 Hz. Interpretation of Frequency Components: e. The central peak around 0 Hz indicates the strong presence of DC (low-frequency) components, while the rapid decay of the spectrum indicates minimal contributions from higher frequencies. This reflects that the Gaussian pulse is smooth and doesn’t exhibit rapid changes, which would require higher frequency components. 5.FIR Filter (Finite Impulse Response) 1. Time Domain Behavior: ○ The FIR filter attenuates the high-frequency component at 100 Hz while retaining the 10 Hz component, as seen in the FIR-filtered signal plot. 2. Frequency Domain Behavior: ○ In the frequency domain, the 100 Hz component is clearly attenuated in the FFT of the FIR-filtered signal, while the 10 Hz component is preserved. However, the attenuation at 100 Hz is not complete, and some leakage is visible. ○ The frequency response of the FIR filter shows a gradual roll-off after the 50 Hz cutoff frequency, and as a result, the filter does not entirely remove components near the cutoff, which is a common issue in finite-length FIR filters. This makes the filtering less sharp than with an IIR filter. IIR Filter (Infinite Impulse Response) 1. Time Domain Behavior: ○ The IIR filter effectively removes the high-frequency 100 Hz component and retains the 10 Hz component with minimal distortion. Compared to the FIR filter, the IIR filter introduces significantly less delay and closely matches the low-frequency content of the original signal. ○ The IIR filter achieves this with fewer coefficients, making it computationally more efficient than the FIR filter. However, recursive IIR filters can introduce phase shifts, although using zero-phase filtering (filtfilt) here minimizes this issue. 2. Frequency Domain Behavior: ○ The FFT of the IIR-filtered signal shows complete suppression of the 100 Hz component, while the 10 Hz component is preserved cleanly, indicating the IIR filter’s sharper frequency cutoff. ○ The frequency response of the IIR filter shows a much steeper roll-off after 50 Hz, which results in better attenuation of higher frequencies compared to the FIR filter. The IIR filter performs better in suppressing unwanted frequencies while retaining the desired low-frequency components. Comparison of FIR and IIR Filters: 1. Attenuation Efficiency: ○ Both the FIR and IIR filters effectively suppress the high-frequency component at 100 Hz, but the IIR filter achieves sharper attenuation, as evidenced by the frequency response and FFT plots. ○ The FIR filter’s gradual roll-off leads to some residual presence of the 100 Hz component in the frequency spectrum, whereas the IIR filter eliminates it more effectively. 2. Delay and Phase Distortion: ○ The IIR filter, being recursive, introduces minimal delay while still suppressing the high-frequency content, making it more suitable for real-time applications that require lower latency. Phase distortion can be a concern in IIR filters, but the use of zero-phase filtering (filtfilt) avoids this. 3. Computational Efficiency: ○ The FIR filter requires more coefficients and thus more computational resources to achieve similar performance compared to the IIR filter. ○ The IIR filter is more computationally efficient, as it uses fewer coefficients to achieve a sharper cutoff and better attenuation of high frequencies.
Write your name and section number at the top right of this page: Class Section Number Bret 8:50 001 Sahifa 1:20 003 Miranda 9:55 004 Sahifa 3:30 005 Sahifa 8:50 006 Cameron 1:20 007 As you complete the exam, write your initials at the top right of each other page.If you need more room, there is a blank page at the end of the exam, or we can give you some scratch paper. Some multiple choice questions are “Select ONE” while others are “Select ALL that apply”. Pay attention to the question type and only mark one option if it says “Select ONE”. Fill in the circles completely. 1. A dataframe lions has 32 rows, each representing a unique lion. It has two numeric columns: age , each lion’s age in decimal years, and proportion.black , the percentage of that lion’s nose which is black. Finally, it has a logical column adult , which is TRUE if age is greater than 3 and FALSE otherwise. Consider the following code which attempts to make a scatter plot of age on the X axis vs. proportion.black on the y axis.Which of the following statements are true about errors in the above code? Select ALL that apply. fill = adult must be changed to color = adult to color the points differently. alpha = 0.5 must be moved outside of aes() , like size = 2 currently is. size = 2 must be moved within aes() , like alpha = 0.5 is. fill = adult must be moved to the aes() in the ggplot() call, not geom point() . 2. Using the same dataframe as Question 1, consider the following code which creates a new dataframe, lions summary .Which of the following statements are true about lions summary ? Select ALL that apply. lions summary has 2 rows. lions summary has 2 columns. (False: adult, averageAge, and averagePropBlack.) lions summary has a column called ‘adult‘. If there were any NA values in the age column of lions , all the values of averageAge in lions summary will be NA (Sneakily false: If there are only NA values in one category of adult, the other can still function!) 3. A random viewer’s ideal movie length, in minutes, is approximately X ∼ N(120,15). Actual movie lengths are given by Y ∼ N(143,19). (a) Which R code below calculates the probability that a random movie is shorter than the 40th percentile for ideal movie length? Select ONE. qnorm(0.4, 120, 15) %>% pnorm(143, 19) qnorm(0.4, 143, 19) %>% pnorm(120, 15) pnorm(0.4, 120, 15) %>% qnorm(143, 19) pnorm(0.4, 143, 19) %>% qnorm(120, 15) (b) What movie length y corresponds to a z-score of 1? Select ONE. 19143 124162 (143 + 19 = 162) 4. Your friend claims that a random variable, X, follows the following probability distribution:Which of the following are true about your friend’s claim? Select ONE. This distribution has negative values of x, therefore X is not a random variable. This distribution has negative values of P(X = x), therefore X is not a random variable. The area under this distribution is not 1, therefore X is not a random variable. X is a valid random variable. 5. Consider ordering a ”footlong” (12 inch) sub from Subway. Consider trying to estimate µ, the average length of the sub you receive when you order a 12 inch sub. You do this by ordering 40 footlong subs and measuring their lengths in inches. (What a great problem this is.) You observe that the average length of your 40 subs is 11.5 inches, and the standard deviation of their lengths is 0.4 inches. Consider testing H0 : µ = 12 vs. Hα : µ < 12. Write a numeric expression (only containing numbers and artihmetic symbols like + and ×) for the test statistic of this test. x¯ − µnull √ sx/ n Answer:6. The fictitious distribution of the number of children in a household (X) is given below. x 0 1 2 3 P(X = x) ? 0.4 0.25 0.15 (a) Which of the following statements about X is true? Select ONE. X is a discrete RV. X is a binomial RV. X is a continuous RV. X is a normal RV. (b) What value of P(X = 0) makes this a valid probability distribution? Select ONE. 0.1 0.15 0.2 0.25 (c) What is the median number of children per household? Select ONE. 0 1 2 Not enough infomration to determine 7. Data is collected on the duration each winter that a lake’s surface is frozen over a period of 103 consecutive years. The correlation coefficient between the variables is r = −0.4. The year variable has a mean ¯x = 1950 and a standard deviation sx = 30. The freeze duration variable has ¯y = 90 and a standard deviation sy = 17. Consider fitting a simple linear regression model to this data. Write a numerical expression (using only numbers and arithmetic symbols like + or ×) for the predicted freeze duration in the year 1890. No need to simplify. Answer: (1890 is two x standard deviations BELOW the mean of x, so our predicted value will be r * two y standard deviations below the mean of y.)* 90 + (−0.4) × (−2) × 17 *You could also take the long way:* and βˆ0 = 90 − βˆ1 ∗ 1950, so ˆy1 = βˆ0 + βˆ1 ∗ 1890 8. We continue to consider the lake freezing data from Problem 7. A 95% confidence interval for the freeze duration of the true regression line at x = 1890 has the form: yˆ1 ± c × SEC A 95% prediction interval for the duration that the lake was frozen in the single year 1890 has the form yˆ1 ± c × SEP where c is a critical value from some distribution and SEC, SEP are some positive values. Reminder: There are 103 years in the dataset. Which R code calculates the numeric critical value c? Select ONE. qnorm(0.95) qnorm(0.975) qt(0.95, 101) qt(0.975, 101) qt(0.95, 102) qt(0.975, 102) 9. Using the expressions from the previous problem, circle the true relationship between SEC and SEP and briefly explain why. SEC < SEP SEC = SEP SEC > SEP Answer: A confidence interval for the height of the regression line is narrower than a prediction interval for at the same x point; but they have the same center and critical value. Therefore, it must be that SEC < SEP. 10. Consider trying to estimate the proportion of UW-Madison undergraduate students who will graduate at the end of this semester, p. You ask 50 random students, and 7 of them will graduate at the end of the semester. Consider using this information to calculate a confidence interval for p. Which of the following statements are true? Select ALL that apply. If we were to calculate an Agresti-Coull confidence interval for p, its center would be (7 + 1)/(50 + 2). (False: It would be (7 + 2)/(50 + 4).) If we were to calculate a Wald confidence interval p, its center would be 7/50. The upper bound of the Agresti-Coull confidence interval would be greater than the upper bound of the Wald confidence interval. (it’s shifted towards 0.5) As we increase the confidence level of our interval towards 100%, our interval would widen and approach [0, 1]. 11. We are interested in comparing the proportions of individuals with bachelor degrees among of adult men aged 25 and older in the states of Minnesota and Wisconsin. Let pM and pW represent these two population proportions. In random samples of nM = 300 Minnesota men and nW = 400 Wisconsin men aged 25 and older, the numbers of individuals with bachelor degrees are xM = 90 and xW = 100. Without simplification, write a numerical expression using provided data for the standard error SEci used in a 95% confidence interval for pM − pW of the form (point estimate) ± zcrit × SEci when using the Agresti-Coull method. Answer: SEci 12. In the setting of the problem above, write a numerical expression using provided data for the standard error SEht of the test statistic Z: pˆM − pˆW Z = ∼ N(0,1) SEht Answer: SEht 13. Suppose that the test statistic from the previous problem has the value z = 1.47 and that the p-value from a two-sided test is 0.14. Which of the following statements are true? Select ALL that apply. The test is statistically significant at an α = 0.05 level. A 95% confidence interval for pM − pW will contain the value 0. ( insignificant p-value) 14. You ask a group of 30 people from Country A to each privately give you a random number between 1 and 1000. You then ask a group of 40 people from Country B to each privately give you a random number between 1 and 1000. Let µA be the true average value that all members of country A would give upon this request, and similarly for µB. You are interested in the quantity µA − µB. Which of the following statements are true about how to approach this problem through statistical inference? Select ONE. Two-sample means inference is impossible to conduct because the sample sizes are different. Two-sample means inference is impossible to conduct because the random variables are technically discrete. Two-sample inference is more appropriate than paired-sample inference in this case. Paired-sample inference is more appropriate than two-sample inference in this case. 15. Assume the correct value of the test statistic from problem 14 is -50. Which R code correctly calculates the p-value for this test? Select ONE. pt(-50, df = W) 2*pt(-50, df = W) 2*pt(abs(-50), df = W) 1 – pt(-50, df = W)
This assignment has in total 55 base points and 10 extra points, and the cap is 55. Bonus questions are indicated using the ⋆ mark. Please specify the following information before submission: • Your Name: Yufeng Xu • Your NetID: yx3038 Problem 1: Infinite knapsack [10⋆ +15 pts] (a)⋆ Prove that in any optimal solution (n1,…,nm), the total number of copies of the items I2,…,Im included in the knapsack is at most wmax − 1, i.e., (b) Based on the conclusion of (a), design an algorithm Knapsack(m,W,(w1,v1),…,(wm,vm)) for infinite knapsack with running time ). For convenience, your algorithm only need to return the maximum value achieved. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. (Hint: According to (a), we know that, in an optimal solution, the total weight of the items ), and hence the item I1 occupies “most” space of the knapsack. Try to combine this observation with the O(mW)-time DP algorithm.) Solution. Please write down your solution to Problem 1 here. (a) We prove the problem by contradiction. Suppose there exists an optimal solution where . We denote these items as where n ≥ wmax, and for each , ∃i = 2,…,m such that wj′ = wi. There can be multiple wj′ s that are equal to the same wi. Now, we want to show that ∃1 ≤ l ≤ r ≤ n such that is a multiple of w1. Consider the sum of the first k wj′ s, denoted as . Assume Sk ≡ Rk(mod w1), where Rk = 0,1,…,w1−1. (i) if ∃k such that Rk = 0, then take is a multiple of w1. (ii) if ∄k such that Rk = 0, Rk can only take w1 − 1 possible values. Because n ≥ wmax ≥ w1 ≥ w1 − 1, by the Pigeon-hole Principle, ∃k1,k2 such that Rk1 = Rk2, hence take w1). Next, we can take 1 ≤ l ≤ r ≤ n such that . Therefore, we can replace with t w1’s, and the final value is going to be larger, which is contradictory to our assumption that the original solution is optimal. Therefore, , i.e., (b) We know from (a) that 1, hence . Consider the naive infinite knapsack algorithm. Let DP(i,w) represent the maximul value by choosing from items {i,i + 1,…,m} with total weight w. Because we can either take ni the i-th item or take none of them, DP(i,w)=max{DP(i + 1,w), vi+DP(i,w − wi)}. Because i = 1,2,…,m, w = 0,1,…,W, hence this algorithm is of time complexity O(nW). Now, with the conclusion from (a), we can limit the search space of w for i = 2,…,m to . Then, the global maximum is equal to maxw{DP(2,w . For items 2,…,m, given any total weight limit w, with the infinte knapsack problem, we can obtain the optimal solution that maximizes the total value of the items chosen. For the whole problem, we learned from (a) that the total weight of items 2,…,m ≤ wmax2. Therefore, we can limit the total for 2,…,m to wmax2 and simply take as many item 1 as possible for the weight limit we haven’t used. By using this method, we can obtain the correct optimal solution. Because the weight limit for item 2,…,m shrinks from , and it takes O(1) time to add item 1 to the partial solution, the time complexity of the revised algorithm is Below is the pseudocode: def KNAPSACK(m, W, I ): # I is a l i s t of (w, v ) ’ s I = sorted(I , key=lambda x : x . v / x .w) w max = max([ item .w for item in I ]) DP = [[0 for w in range(w max ∗∗ 2)] for idx in range(m + 1)] # m+1 is set to handle boundary cases for idx in range(m − 1 , 0 , −1): for w in range(w max ∗∗ 2 + 1): if w >= I [ idx ] .w: DP[ idx ] [w] = max( I [ idx ] . v + DP[ idx ] [w − I [ idx ] .w] , DP[ idx + 1][w]) else : DP[ idx ] [w] = DP[ idx + 1][w] ans = max([DP[ 1 ] [w] + floor ((W − w) / I [ 0 ] .w) ∗ I [ 0 ] . v ]) return ans Problem 2: Counting shortest paths [15 pts] Given an undirected graph G = (V,E) and a vertex s ∈ V , we want to know for every vertex t ∈ V , how many (distinct) shortest paths from s to t there are. This can be viewed as a generalization of the unique shortest-path problem. Design an algorithm Count(G,s) to solve this problem. The algorithm should return a list L that contains a pair (v,δv) for each vertex v of G, where δv is the number of shortest paths from s to v. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(n + m) time, where n = |V | and m = |E|. A sample example. Suppose the input graph is G = (V,E) where V = {s,a,b,c,d,e} and E = {(a,b),(b,c),(c,d),(a,d),(s,a),(c,e)}; see the figure below. Then the algorithm Count(G,s) should return L = [(s,1),(a,1),(b,1),(c,2),(d,1),(e,2)]. (Hint: Exploit the BFS tree and extend the idea in the unique shortest-path problem.)Solution. Please write down your solution to Problem 2 here. We solve the problem by BFS. Consider a BFS tree of the graph, we view s as the root, and perform a level-order traversal. For each node, we count the number of it on that level as the number of shortest paths. For the neighbors of that node, if the number of shortest paths has not been counted for a neighbor, we append it to the queue that maintains the BFS tree, otherwise we just ignore it, because it must have been studied in a higher level, and the node’s appearance in the current level does not corresponds to a shortest path. The pseudocode is as follows: class Node: def i n i t ( s e l f ): s e l f . neighbors = [ ] def add neighbor ( self , other ): s e l f . neighbors . append( other ) other . neighbors . append( s e l f ) def COUNT(G, s ): (V, E) = G # V: l i s t of Node ; E: l i s t of tuples of 2 Nodes for edge in E: edge [ 0 ] . add neighbor ( edge [1]) cnt = {} queue = [ s ] while queue != [ ] : L = len(queue) print(L) for u in queue [ : L ] : cnt [u] = cnt . get (u, 0) + 1 print(u. neighbors ) for v in u. neighbors : if v not in cnt . keys (): queue . append(v) queue = queue [L : ] L = [( item [0] , item [1]) for item in cnt . items ()] return L This algorithm is correct because all the nodes that have the same shortest-path-length are on the same level of the BFS tree, so by counting the times a node appear in the shallowest level where it appears, we can count the number of shortest paths to the node. Because only the shortest path will be considered, each edge will be visited once and only once. The sum of the number of times each node is visited is O(m), therefore, this algorithm is O(m) (and consequently O(m + n)). Problem 3: Reaching the one [15 pts] Let n ≥ 3 be a given positive integer and we are going to play a game as follows. During the game, we have a number k which is initially equal to 2. In each round of the game, we are allowed to change the current number k to a new number in one of the following ways: • Type-A: Change k to (k + 1) mod n. • Type-B: Change k to a divisor of k that is greater than 1. • Type-C: Change k to k2 mod n. Note that the above three rules guarantee that our number k is always in the range {0,1,…,n−1} during the entire game. The game terminates when k becomes 1. Our goal is to finish the game in fewest rounds. For example, suppose n = 91 and we can play the game as follows. • Round 1: 2 =⇒ 3 using Type-A change. • Round 2: 3 =⇒ 9 using Type-C change. • Round 3: 9 =⇒ 81 using Type-C change. • Round 4: 81 =⇒ 27 using Type-B change. • Round 5: 27 =⇒ 1 using Type-C change. As you can verify, this is an optimal solution. Design an algorithm Game(n) which returns an optimal solution to the game as a list. So Game(91) should return [2,3,9,81,27,1] (or another solution with 5 rounds). Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(nlogn) time. (Hint: Formulate the problem as a graph problem with n vertices and O(nlogn) edges.) Solution. Please write down your solution to Problem 3 here. We construct a graph based on the problem setting: all the possible remainders of n: 0,1,…,n−1 are constructed as n vertices. For Type-A operation, we connect vertex k to vertex k+1 for k = 0,1,…,n−2, and connect n−1 to 0. For Type-B operation, we connect vertex k to all its divisors larger than 1. For Type-C operation, we connect vertex k to k2(mod n) for k = 2,…,n − 1. Next, we consider the total number of directed edges. For Type-A, the number of directed edges is n; for Type-B, the total number of edges is ; for Type-C, the total number of edges is n − 2. Therefore, the total number of directed edges 1) . By integral test, +1, where ). Therefore, )). Also, n is O(nlog(n)), hence the number of directed edges is O(nlog(n)). Next, we perform a BFS search on this graph G. We use a queue to maintain the BFS procedure. We first append the starting vertex 2 to the queue. Every time, for the first vertex in the queue, if it’s 1, we quit the BFS; otherwise we append all of its unvisited neighbors to the end of the queue, and mark the current vertex the previous vertex of all its neighbors (if a neighbor does not have a recorded previous vertex). This algorithm can always find the shortest path between 2 and 1, because the nearest vertex will always come to the head of the queue. So when 1 is visited, the path is guaranteed to be shortest. The time complexity of BFS is O(nlog(n)) because the total number of edges in the graph is O(nlog(n)), and each edge is visited at most once. The time needed to construct the graph is also O(nlog(n)) because each edge takes O(1) time to construct. Therefore, the overall complexity of the algorithm is O(nlog(n)). The pseudocode is given as follows: class Vertex : def i n i t ( self , val ): s e l f . val = val s e l f . neighbors = [ ] def add neighbor ( self , other ): s e l f . neighbors . append( other ) def GAME(n ): V = [ Vertex ( i ) for i in range(n )] # Type A edge for i in range(n − 1): V[ i ] . add neighbor (V[ i + 1]) # Type B edge for i in range(2 , n ): for j in range(2 ∗ i , n, i ): V[ j ] . add neighbor (V[ i ]) # Type C edge for i in range(2 , n ): if i ∗∗ 2 % n != i : V[ i ] . add neighbor (V[ i ∗∗ 2 % n ]) # BFS visited = [0 for i in range(n )] prev = [−1 for i in range(n )] queue = [2] while queue != [ ] : head = queue [0] queue = queue [ 1 : ] visited [ head ] = 1 if head == 1: break for n in V[ head ] . neighbors : if not visited [n. val ] : queue . append(n. val ) if prev [n. val ] == −1: prev [n. val ] = head sol = [1] while sol [−1] != −1: sol . append( prev [ sol [ −1]]) return sol [ −2:: −1] Problem 4: Finding a strongly connected component [10 pts] Given a directed graph G = (V,E) and a vertex s ∈ V , we want to compute the strongly connected component of G containing s. Design an algorithm SCC(G,s) which returns the set of vertices of G that lie in the same strongly connected component as s. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(n + m) time, where n = |V | and m = |E|. (Hint: By definition, a vertex v is in the same strongly connected component as s iff v is reachable from s and s is reachable from v. We already know how to find the vertices reachable from s using DFS. Can you use a similar idea to compute the vertices from which s is reachable?) Solution. Please write down your solution to Problem 4 here. For each node u, we store all the nodes v that (u,v) ∈ E as its out-neighbors and all the nodes w that (w,u) ∈ E as its in-neighbors. Then we perform DFS based on out-neighbors to obtain all the nodes that s can reach, and DFS based on in-neighbors to obtain all the nodes that can reach s. Lastly, we take the intersection of the two results, so that we can obtain the set of points in the strongly connected components containing s. This algorithm is correct, as we know DFS can reach a point t iff it’s reachable from s. If we reverse the directions of all edges, and perform DFS again, a point t is reached iff ∃ a path consisting of reversed from s to t, i.e., ∃ a edge from t to s, s is reachable from t. By taking the union of the results obtained by the 2 DFS’s, a node t is kept iff it is reachable from s and it can reach s, i.e., it is strongly connected to s. The time complexity to process in-neighbors and out-neighbors is O(m). The time complexity of DFS is O(n + m). Therefore, the time complexity of the whole algorithm is O(n + m). The pseudocode is as follows: class Node: def i n i t ( self , val ): s e l f . val = val s e l f . neighbors = [ [ ] , [ ] ] # 0 for in , 1 for out def add neighbor ( self , other ): s e l f . neighbors [ 1 ] . append( other ) other . neighbors [ 0 ] . append( s e l f ) # add a edge from s e l f to other def SCC(G, s ): (V, E) = G for (u, v) in E: u. add neighbor (v) res = [ set () , set ()] # res [0] for a l l the nodes can be reached from s ; # res [1] for a l l the nodes that can reach s def DFS(u, direction = 0): # direction = 0 for in−neighbors , 1 for out−neighbors res [ direction ] . add(u) for v in u. neighbors [ direction ] : if v not in res [ direction ] : DFS(v , direction = direction ) DFS(s , 0) DFS(s , 1) ans = res [ 0 ] . intersection ( res [1])ans . remove( s ) return ans
CSC718 Operating Sys & Parallel Programming Homework 4 MPI, OpenMP, and GPU Programming Assignments Total: 100 points The homework assignment will be graded based on the following criteria: • Accuracy: 1) the solution meets specific requirements in the problem description; 2) the solution produces correct results; 2) the procedures adopted in the solution are technically sound. • Efficiency: efficiency will be one of the criteria when grading programing assignment. The solution should produce the desired results efficiently. • Effort/neatness: the solution includes excellent effort, and all relate work is shown neatly and organized well. Homework assignment feedback will be available through the DropBox folder on D2L. For all the programming assignments, you can choose any operating systems you want. I will usually provide C/C++ samples for the programming assignments. If you prefer to use other languages, e.g., Java, they are accepted too. A README.txt is required to submit any programming assignments. In the README.txt, you need to provide the following information: 1) How to compile your program? 2) How to run your program? 3) What is the average running time of your program? 4) What are the expected output results when I run your program? Zip all you source code, project files, supporting files, and README.txt and submit the all-in-one zip file together to the D2L Dropbox. If you have any questions about the homework, please let me know. (10 points) What are the strengths and limitations of GPU programing? (15 points) Among the four parallel frameworks, multithreaded programming, MPI programming, openMP programming, and GPU programming (e.g., CUDA), we discussed in the class, what will be the strategies you are going to use in general when selecting a parallel framework for your applications? (10 points) Benchmarking of a sequential program reveals that 95 percent of the execution time is spent inside functions that are amendable to parallelization. What is the maximum speed up we could expect from executing a parallel version of this program on 10 processors using Amdahl’s Law? The “acceptable” identifier must meet the following constraints:CSC718 Operating Sys & Parallel Programming • The identifier is a six-digit combination of the numerals 0-9 1. (15 points) Write a sequential program to count the different identifiers given these constraints. 2. (20 points) Convert the sequential program to an MPI parallel program. 3. (20 points) (choose only one: A or B, not both) A. Convert the sequential program to an openMP parallel program. B. Convert the sequential program to a GPU program using a parallel framework at your choice (e.g., CUDA, OpenCL, etc.) 4. (10 points) Benchmark the three programs based on your choice in 3). Problem 1 process/ 1 thread 2 processes / 2 threads 3 processes / 3 threads 4 processes / or 4 threads Program1.c (sequential) n/a n/a n/a Program2.c (MPI) Program3.c (openMP or CUDA)
CS771A Assignment 3The Boys Rajarshi Dutta Udvas Basak Shivam Pandey 200762 201056 200938 1 Question 1 The first part of the solution essentially uses a linear model named Elastic Regression, which takes into account both the effects of Ridge Regression Penalty – L2 loss and Lasso Regression Penalty – L1 loss. The objective/loss function of the model is represented below: C(w) = argmin(1) w where, • w′ = Weights multiplied with each of the features present in the dataset taken into account for the construction of the regression model. • α = constant that multiplies both the lasso and ridge penalties. • λ1 = Regularization parameter for the Ridge regression penalty. • λ2 = Regularization parameter for the Lasso regression penalty. During the regularization procedure, the λ1 parameter leads to the formation of a sparse model, whereas the quadratic section of the penalty makes the part of λ1 portion more stable to the path of regularization, decreases the number of variables to be selected thereby promoting the grouping effect. The grouping effect enables the variables to be easily identified by correlation leading to the enhancement of the sampling procedure. This method first finds the Ridge Regression coefficients and then conducts the second step by using a Lasso sort of shrinkage of the coefficients. Inorder to determine the best performing linear model, Randomized Search Cross Validation technique has been applied for efficient hyperparameter selection out of the following sets of hyperparameters of the model. Here grid is a dictionary that contains the different hyperparameters of the Elastic Net Regression model as keys and its values. grid = dict() grid[’alpha’] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 0.2, 1.0] grid[’l1_ratio’] = np.arange(0,1, 0.01) grid[’selection’] = [’cyclic’, ’random’] grid[’max_iter’] = [2, 10, 100, 1000] grid[’tol’] = [1e-4, 1e-3, 1e-5, 1e-2, 1e-1, 0.0] The Randomized Search Cross Validation returns the best estimator with the following hyperparameters ElasticNet(alpha=0.1, l1_ratio=0.62, selection=’random’, tol=0.001, warm_start=True) Preprint. Under review. The corresponding MAE score on the training data are 5.6248 (for the O3 levels) and 6.5136 (for the NO2 levels) respectively. 2 Question 2 A suitable non-linear model which can effectively fit on the given dataset is K Nearest Neighbour algorithm. The KNN can be both used as a classification and regression model, which essentially classifies unknown data points based on the distance with respect to k nearest data points. The model essentially works on two different kinds of distance parameters, namely Euclidean distance, Minkowski distance and the Manhattan distance. One strength of the KNN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships. In the case of regression tasks, the algorithm chooses the k nearest points based on the given distance parameter. It calculates the new point to be approximately equal to the average of these nearest points. Also here Distance-weighted k-nearest-neighbor rule should be such where that it varies with the distance between the sample and the considered neighbor in such a manner that the value decreases with increasing sample-to-neighbor distance. (2) Here, the optimal value of K has been calculated by plotting the Train MAE loss and Test MAE loss for the case of both O3 and NO2 sensor values. The MAE loss variation for different k values has been shown below.Figure 1: MAE loss variation for K values We find that the KNN Regressor provides the minimum Mean absolute error for a k value of 5. model = KNeighborsRegressor(n_neighbors=5, P = 1, algorithm=”auto” , n_jobs = -1) Some other models tested for the given regression problem includes Xgboost, Multi-layer Perceptron, Random Forest Regressor models. • XGBoost Regressor : XGBoost uses a combination of two techniques, gradient boosting and tree boosting, to improve the accuracy of the model. Gradient boosting involves minimizing the loss function mean absolute error by iteratively adding weak learners (e.g., decision trees) to the model. Tree boosting, on the other hand, involves optimizing the split points of the decision tree by minimizing the loss function. • Random Forest Regressor : The algorithm works by creating a large number of decision trees, each of which is trained on a subset of the data and a subset of the features. The 2 final prediction is then made by averaging the predictions of all the trees in the forest. Random forest regression also includes several regularization techniques to prevent overfitting and improve the generalization of the model. These techniques include maximum depth, minimum samples per leaf, and maximum features. • Arttifical Neural Network : The given neural network consists of 64 followed by 32 and 8 layers. The optimizer used is Adam, and the kernel initialisation is He uniform and the loss function used is Mean absolute Error loss. The Perceptron is backpropagated for around 100 epochs on the training data with a batch size of 64. Non-linear Models O3 MAE loss NO2 MAE loss K Neighbours Regressor 3.3284 2.3228 XGboost Regressor 5.1433 4.7230 Random Forest Regressor 6.0784 5.3145 MultiLayer Perceptron 4.8807 4.9178 Table 1: Stats of different non-linear models 3 Question 3 The code is available in the submit.py file. Code snippet is mentioned below: import numpy as np import pickle as pkl # Define your prediction method here # df is a dataframe containing timestamps, weather data and potentials def my_predict( df ): X = np.array(df.iloc[:, 1:]) # Load your model file model = pkl.load(open(“knn_model.pkl”, “rb”)) test_preds = np.transpose(model.predict(X)) # Make two sets of predictions, one for O3 and pred_o3, pred_no2 = test_preds[0], test_preds[1] # Return both sets of predictions return ( pred_o3, pred_no2 ) another for NO2 3
CS771A Assignment 2The Boys Rajarshi Dutta 200762 Udvas Basak 201056 Shivam Pandey 200938 1 Solution 1 1.1 TheoryFigure 1: Illustration of a working ID3 algorithm We have used the Iterative Dichotomizer decision tree algorithm which is considered one of the most robustly used algorithms for supervised machine learning classification tasks. The algorithm works by recursively partitioning the training data based on their attributes until the decision tree produces the purest splits (referred to as the leaves). ID3 uses a top-down greedy approach to build a decision tree. The tree is basically constructed from the top and the greedy approach means that at each iteration the best feature is selected to split the node. 1.2 Our approach The decision tree constructed to solve the given Wordle-Solvr problem consists of the following components: 1.2.1 process node • For the process node function, every non-root node present in the ID3 tree constructed is considered and passed through the get entropy function which calculates the index of the vocabulary that minimizes entropy the most or maximizes the Information Gain. This index can be used to access the most appropriate query that can be used at that step. Preprint. Under review. • All words from list all words list are extracted and reveal function helps to uncover the mask between the query and the required word which is stored in split dict as the key. The value of the split dict contains an array of indices that returns the given mask when queried. get entropy • The get entropy function iterates over all of the words present in that node (possible candidate queries) and passes the array to the function calc entropy which returns the entropy of the current node. • Another dictionary is maintained which given the mask, corresponds to a number of queries(which when queried with the word provides that particular mask). These values are used to calculate the sum of weighted entropy after splitting the node. • The difference of the weighted sum of entropies after splitting the node with the initialentropy calculated from the parent node gives us the information gain used to produce the best split out of all words. (1) where, V = possible values in Attribute A, S = set of examples X, Sv = subset where XA = V • Information gain indirectly describes the mutual information between the attribute and theclass of labels S. calc entropy • This function simply calculates the Shannon Entropy based on the formula where p represents the probability of each class label which can be produced by the corresponding split based on their attribute. E(p) = −p · log2(p) (2) 2 Solution 2 The entire algorithm has been implemented in the submit.py file. 2
Write your name and section number at the top right of this page: Class Section Number Bret 8:50 001 Sahifa 1:20 003 Miranda 9:55 004 Sahifa 3:30 005 Sahifa 8:50 006 Cameron 1:20 007 As you complete the exam, write your initials at the top right of each other page.If you need more room, there is a blank page at the end of the exam, or we can give you some scratch paper. Some multiple choice questions are “Select ONE” while others are “Select ALL that apply”. Pay attention to the question type and only mark one option if it says “Select ONE”. Fill in the circles completely. 1. A dataframe lions has 32 rows, each representing a unique lion. It has two numeric columns: age , each lion’s age in decimal years, and proportion.black , the percentage of that lion’s nose which is black. Finally, it has a logical column adult , which is TRUE if age is greater than 3 and FALSE otherwise. Consider the following code which attempts to make a scatter plot of age on the X axis vs. proportion.black on the y axis.Which of the following statements are true about errors in the above code? Select ALL that apply. fill = adult must be changed to color = adult to color the points differently. alpha = 0.5 must be moved outside of aes() , like size = 2 currently is. size = 2 must be moved within aes() , like alpha = 0.5 is. fill = adult must be moved to the aes() in the ggplot() call, not geom point() . 2. Using the same dataframe as Question 1, consider the following code which creates a new dataframe, lions summary .Which of the following statements are true about lions summary ? Select ALL that apply. lions summary has 2 rows. lions summary has 2 columns. (False: adult, averageAge, and averagePropBlack.) lions summary has a column called ‘adult‘. If there were any NA values in the age column of lions , all the values of averageAge in lions summary will be NA (Sneakily false: If there are only NA values in one category of adult, the other can still function!) 3. A random viewer’s ideal movie length, in minutes, is approximately X ∼ N(120,15). Actual movie lengths are given by Y ∼ N(143,19). (a) Which R code below calculates the probability that a random movie is shorter than the 40th percentile for ideal movie length? Select ONE. qnorm(0.4, 120, 15) %>% pnorm(143, 19) qnorm(0.4, 143, 19) %>% pnorm(120, 15) pnorm(0.4, 120, 15) %>% qnorm(143, 19) pnorm(0.4, 143, 19) %>% qnorm(120, 15) (b) What movie length y corresponds to a z-score of 1? Select ONE. 19143 124162 (143 + 19 = 162) 4. Your friend claims that a random variable, X, follows the following probability distribution:Which of the following are true about your friend’s claim? Select ONE. This distribution has negative values of x, therefore X is not a random variable. This distribution has negative values of P(X = x), therefore X is not a random variable. The area under this distribution is not 1, therefore X is not a random variable. X is a valid random variable. 5. Consider ordering a ”footlong” (12 inch) sub from Subway. Consider trying to estimate µ, the average length of the sub you receive when you order a 12 inch sub. You do this by ordering 40 footlong subs and measuring their lengths in inches. (What a great problem this is.) You observe that the average length of your 40 subs is 11.5 inches, and the standard deviation of their lengths is 0.4 inches. Consider testing H0 : µ = 12 vs. Hα : µ < 12. Write a numeric expression (only containing numbers and artihmetic symbols like + and ×) for the test statistic of this test. x¯ − µnull √ sx/ n Answer:6. The fictitious distribution of the number of children in a household (X) is given below. x 0 1 2 3 P(X = x) ? 0.4 0.25 0.15 (a) Which of the following statements about X is true? Select ONE. X is a discrete RV. X is a binomial RV. X is a continuous RV. X is a normal RV. (b) What value of P(X = 0) makes this a valid probability distribution? Select ONE. 0.1 0.15 0.2 0.25 (c) What is the median number of children per household? Select ONE. 0 1 2 Not enough infomration to determine 7. Data is collected on the duration each winter that a lake’s surface is frozen over a period of 103 consecutive years. The correlation coefficient between the variables is r = −0.4. The year variable has a mean ¯x = 1950 and a standard deviation sx = 30. The freeze duration variable has ¯y = 90 and a standard deviation sy = 17. Consider fitting a simple linear regression model to this data. Write a numerical expression (using only numbers and arithmetic symbols like + or ×) for the predicted freeze duration in the year 1890. No need to simplify. Answer: (1890 is two x standard deviations BELOW the mean of x, so our predicted value will be r * two y standard deviations below the mean of y.)* 90 + (−0.4) × (−2) × 17 *You could also take the long way:* and βˆ0 = 90 − βˆ1 ∗ 1950, so ˆy1 = βˆ0 + βˆ1 ∗ 1890 8. We continue to consider the lake freezing data from Problem 7. A 95% confidence interval for the freeze duration of the true regression line at x = 1890 has the form: yˆ1 ± c × SEC A 95% prediction interval for the duration that the lake was frozen in the single year 1890 has the form yˆ1 ± c × SEP where c is a critical value from some distribution and SEC, SEP are some positive values. Reminder: There are 103 years in the dataset. Which R code calculates the numeric critical value c? Select ONE. qnorm(0.95) qnorm(0.975) qt(0.95, 101) qt(0.975, 101) qt(0.95, 102) qt(0.975, 102) 9. Using the expressions from the previous problem, circle the true relationship between SEC and SEP and briefly explain why. SEC < SEP SEC = SEP SEC > SEP Answer: A confidence interval for the height of the regression line is narrower than a prediction interval for at the same x point; but they have the same center and critical value. Therefore, it must be that SEC < SEP. 10. Consider trying to estimate the proportion of UW-Madison undergraduate students who will graduate at the end of this semester, p. You ask 50 random students, and 7 of them will graduate at the end of the semester. Consider using this information to calculate a confidence interval for p. Which of the following statements are true? Select ALL that apply. If we were to calculate an Agresti-Coull confidence interval for p, its center would be (7 + 1)/(50 + 2). (False: It would be (7 + 2)/(50 + 4).) If we were to calculate a Wald confidence interval p, its center would be 7/50. The upper bound of the Agresti-Coull confidence interval would be greater than the upper bound of the Wald confidence interval. (it’s shifted towards 0.5) As we increase the confidence level of our interval towards 100%, our interval would widen and approach [0, 1]. 11. We are interested in comparing the proportions of individuals with bachelor degrees among of adult men aged 25 and older in the states of Minnesota and Wisconsin. Let pM and pW represent these two population proportions. In random samples of nM = 300 Minnesota men and nW = 400 Wisconsin men aged 25 and older, the numbers of individuals with bachelor degrees are xM = 90 and xW = 100. Without simplification, write a numerical expression using provided data for the standard error SEci used in a 95% confidence interval for pM − pW of the form (point estimate) ± zcrit × SEci when using the Agresti-Coull method. Answer: SEci 12. In the setting of the problem above, write a numerical expression using provided data for the standard error SEht of the test statistic Z: pˆM − pˆW Z = ∼ N(0,1) SEht Answer: SEht 13. Suppose that the test statistic from the previous problem has the value z = 1.47 and that the p-value from a two-sided test is 0.14. Which of the following statements are true? Select ALL that apply. The test is statistically significant at an α = 0.05 level. A 95% confidence interval for pM − pW will contain the value 0. ( insignificant p-value) 14. You ask a group of 30 people from Country A to each privately give you a random number between 1 and 1000. You then ask a group of 40 people from Country B to each privately give you a random number between 1 and 1000. Let µA be the true average value that all members of country A would give upon this request, and similarly for µB. You are interested in the quantity µA − µB. Which of the following statements are true about how to approach this problem through statistical inference? Select ONE. Two-sample means inference is impossible to conduct because the sample sizes are different. Two-sample means inference is impossible to conduct because the random variables are technically discrete. Two-sample inference is more appropriate than paired-sample inference in this case. Paired-sample inference is more appropriate than two-sample inference in this case. 15. Assume the correct value of the test statistic from problem 14 is -50. Which R code correctly calculates the p-value for this test? Select ONE. pt(-50, df = W) 2*pt(-50, df = W) 2*pt(abs(-50), df = W) 1 – pt(-50, df = W)
Vikranth Udandarao 2022570 1. To compare the manual convolution with the result from np.convolve(), both were computed and plotted on the same graph. The manual convolution was performed using nested loops to apply the convolution sum directly, while np.convolve() used NumPy’s built-in convolution function, which is optimized for performance. In the plot, we differentiate the two methods by using distinct colors and line widths: the manual convolution is shown with a thicker blue line, and the np.convolve() result is depicted with a thinner red line. Both methods yield nearly identical results, which visually overlap on the plot, indicating that the manual approach correctly implements the convolution process.2.3. . Prominent Frequency Components: ○ The FFT magnitude spectrum clearly shows two primary peaks: ■ One at 5 Hz. ■ The second at 12 Hz. ○ These frequencies correspond exactly to the frequencies f1=5 Hz and f2=12 Hz that were used to construct the signal in the time domain. Amplitudes of the Peaks: ○ The peak at 5 Hz has a magnitude of approximately 32. This is consistent with the fact that the cosine term has an amplitude of 1. In the FFT, the magnitude of this peak is scaled by the number of samples NNN, which results in a larger peak. ○ The peak at 12 Hz has a magnitude of approximately 16. This corresponds to the second cosine term where the amplitude is 0.5. The FFT shows a peak half as large as the one at 5 Hz, which matches the expected amplitude scaling. Symmetry of the Spectrum: ○ The full FFT result (first plot) is symmetric around 0 Hz, with peaks appearing at both positive and negative frequencies. This is a direct consequence of the fact that the signal x[n]x[n]x[n] is real-valued, leading to conjugate symmetry in the frequency domain. ○ The second plot focuses on the positive frequencies, showing only the significant frequency components at 5 Hz and 12 Hz.4. …The Gaussian pulse x(t) is a smoothly varying signal in the time domain. Its magnitude spectrum, obtained through the Fast Fourier Transform (FFT), reveals several important characteristics about the frequency content of the pulse. Gaussian Shape in the Frequency Domain: a. The FFT of a Gaussian pulse results in another Gaussian-shaped magnitude spectrum in the frequency domain. This property arises because the Fourier Transform of a Gaussian function is also a Gaussian. The spectrum is centered at 0 Hz, reflecting that the pulse is a low-frequency signal with most of its energy concentrated around zero frequency. Dominance of Low-Frequency Components: b. The bell-shaped spectrum shows that the energy of the Gaussian pulse is primarily concentrated at low frequencies. The amplitude decreases rapidly as we move away from 0 Hz. This indicates that the Gaussian pulse is composed mainly of low-frequency components, which corresponds to its smooth and continuous nature in the time domain. Width of the Frequency Spectrum: c. The width of the magnitude spectrum depends on the parameter σsigmaσ, which controls the width of the Gaussian pulse in the time domain. In this case, with σ=0.1sigma = 0.1σ=0.1, the spectrum is relatively wide in the frequency domain. According to the time-frequency uncertainty principle, a narrower pulse in the time domain (small σsigmaσ) results in a broader frequency spectrum. d. If σsigmaσ were larger, the time-domain pulse would be wider, and the corresponding frequency spectrum would be narrower, concentrating more energy near 0 Hz. Interpretation of Frequency Components: e. The central peak around 0 Hz indicates the strong presence of DC (low-frequency) components, while the rapid decay of the spectrum indicates minimal contributions from higher frequencies. This reflects that the Gaussian pulse is smooth and doesn’t exhibit rapid changes, which would require higher frequency components. 5.FIR Filter (Finite Impulse Response) 1. Time Domain Behavior: ○ The FIR filter attenuates the high-frequency component at 100 Hz while retaining the 10 Hz component, as seen in the FIR-filtered signal plot. 2. Frequency Domain Behavior: ○ In the frequency domain, the 100 Hz component is clearly attenuated in the FFT of the FIR-filtered signal, while the 10 Hz component is preserved. However, the attenuation at 100 Hz is not complete, and some leakage is visible. ○ The frequency response of the FIR filter shows a gradual roll-off after the 50 Hz cutoff frequency, and as a result, the filter does not entirely remove components near the cutoff, which is a common issue in finite-length FIR filters. This makes the filtering less sharp than with an IIR filter. IIR Filter (Infinite Impulse Response) 1. Time Domain Behavior: ○ The IIR filter effectively removes the high-frequency 100 Hz component and retains the 10 Hz component with minimal distortion. Compared to the FIR filter, the IIR filter introduces significantly less delay and closely matches the low-frequency content of the original signal. ○ The IIR filter achieves this with fewer coefficients, making it computationally more efficient than the FIR filter. However, recursive IIR filters can introduce phase shifts, although using zero-phase filtering (filtfilt) here minimizes this issue. 2. Frequency Domain Behavior: ○ The FFT of the IIR-filtered signal shows complete suppression of the 100 Hz component, while the 10 Hz component is preserved cleanly, indicating the IIR filter’s sharper frequency cutoff. ○ The frequency response of the IIR filter shows a much steeper roll-off after 50 Hz, which results in better attenuation of higher frequencies compared to the FIR filter. The IIR filter performs better in suppressing unwanted frequencies while retaining the desired low-frequency components. Comparison of FIR and IIR Filters: 1. Attenuation Efficiency: ○ Both the FIR and IIR filters effectively suppress the high-frequency component at 100 Hz, but the IIR filter achieves sharper attenuation, as evidenced by the frequency response and FFT plots. ○ The FIR filter’s gradual roll-off leads to some residual presence of the 100 Hz component in the frequency spectrum, whereas the IIR filter eliminates it more effectively. 2. Delay and Phase Distortion: ○ The IIR filter, being recursive, introduces minimal delay while still suppressing the high-frequency content, making it more suitable for real-time applications that require lower latency. Phase distortion can be a concern in IIR filters, but the use of zero-phase filtering (filtfilt) avoids this. 3. Computational Efficiency: ○ The FIR filter requires more coefficients and thus more computational resources to achieve similar performance compared to the IIR filter. ○ The IIR filter is more computationally efficient, as it uses fewer coefficients to achieve a sharper cutoff and better attenuation of high frequencies.
This assignment has in total 55 base points and 10 extra points, and the cap is 55. Bonus questions are indicated using the ⋆ mark. Please specify the following information before submission: • Your Name: Yufeng Xu • Your NetID: yx3038 Problem 1: Infinite knapsack [10⋆ +15 pts] (a)⋆ Prove that in any optimal solution (n1,…,nm), the total number of copies of the items I2,…,Im included in the knapsack is at most wmax − 1, i.e., (b) Based on the conclusion of (a), design an algorithm Knapsack(m,W,(w1,v1),…,(wm,vm)) for infinite knapsack with running time ). For convenience, your algorithm only need to return the maximum value achieved. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. (Hint: According to (a), we know that, in an optimal solution, the total weight of the items ), and hence the item I1 occupies “most” space of the knapsack. Try to combine this observation with the O(mW)-time DP algorithm.) Solution. Please write down your solution to Problem 1 here. (a) We prove the problem by contradiction. Suppose there exists an optimal solution where . We denote these items as where n ≥ wmax, and for each , ∃i = 2,…,m such that wj′ = wi. There can be multiple wj′ s that are equal to the same wi. Now, we want to show that ∃1 ≤ l ≤ r ≤ n such that is a multiple of w1. Consider the sum of the first k wj′ s, denoted as . Assume Sk ≡ Rk(mod w1), where Rk = 0,1,…,w1−1. (i) if ∃k such that Rk = 0, then take is a multiple of w1. (ii) if ∄k such that Rk = 0, Rk can only take w1 − 1 possible values. Because n ≥ wmax ≥ w1 ≥ w1 − 1, by the Pigeon-hole Principle, ∃k1,k2 such that Rk1 = Rk2, hence take w1). Next, we can take 1 ≤ l ≤ r ≤ n such that . Therefore, we can replace with t w1’s, and the final value is going to be larger, which is contradictory to our assumption that the original solution is optimal. Therefore, , i.e., (b) We know from (a) that 1, hence . Consider the naive infinite knapsack algorithm. Let DP(i,w) represent the maximul value by choosing from items {i,i + 1,…,m} with total weight w. Because we can either take ni the i-th item or take none of them, DP(i,w)=max{DP(i + 1,w), vi+DP(i,w − wi)}. Because i = 1,2,…,m, w = 0,1,…,W, hence this algorithm is of time complexity O(nW). Now, with the conclusion from (a), we can limit the search space of w for i = 2,…,m to . Then, the global maximum is equal to maxw{DP(2,w . For items 2,…,m, given any total weight limit w, with the infinte knapsack problem, we can obtain the optimal solution that maximizes the total value of the items chosen. For the whole problem, we learned from (a) that the total weight of items 2,…,m ≤ wmax2. Therefore, we can limit the total for 2,…,m to wmax2 and simply take as many item 1 as possible for the weight limit we haven’t used. By using this method, we can obtain the correct optimal solution. Because the weight limit for item 2,…,m shrinks from , and it takes O(1) time to add item 1 to the partial solution, the time complexity of the revised algorithm is Below is the pseudocode: def KNAPSACK(m, W, I ): # I is a l i s t of (w, v ) ’ s I = sorted(I , key=lambda x : x . v / x .w) w max = max([ item .w for item in I ]) DP = [[0 for w in range(w max ∗∗ 2)] for idx in range(m + 1)] # m+1 is set to handle boundary cases for idx in range(m − 1 , 0 , −1): for w in range(w max ∗∗ 2 + 1): if w >= I [ idx ] .w: DP[ idx ] [w] = max( I [ idx ] . v + DP[ idx ] [w − I [ idx ] .w] , DP[ idx + 1][w]) else : DP[ idx ] [w] = DP[ idx + 1][w] ans = max([DP[ 1 ] [w] + floor ((W − w) / I [ 0 ] .w) ∗ I [ 0 ] . v ]) return ans Problem 2: Counting shortest paths [15 pts] Given an undirected graph G = (V,E) and a vertex s ∈ V , we want to know for every vertex t ∈ V , how many (distinct) shortest paths from s to t there are. This can be viewed as a generalization of the unique shortest-path problem. Design an algorithm Count(G,s) to solve this problem. The algorithm should return a list L that contains a pair (v,δv) for each vertex v of G, where δv is the number of shortest paths from s to v. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(n + m) time, where n = |V | and m = |E|. A sample example. Suppose the input graph is G = (V,E) where V = {s,a,b,c,d,e} and E = {(a,b),(b,c),(c,d),(a,d),(s,a),(c,e)}; see the figure below. Then the algorithm Count(G,s) should return L = [(s,1),(a,1),(b,1),(c,2),(d,1),(e,2)]. (Hint: Exploit the BFS tree and extend the idea in the unique shortest-path problem.)Solution. Please write down your solution to Problem 2 here. We solve the problem by BFS. Consider a BFS tree of the graph, we view s as the root, and perform a level-order traversal. For each node, we count the number of it on that level as the number of shortest paths. For the neighbors of that node, if the number of shortest paths has not been counted for a neighbor, we append it to the queue that maintains the BFS tree, otherwise we just ignore it, because it must have been studied in a higher level, and the node’s appearance in the current level does not corresponds to a shortest path. The pseudocode is as follows: class Node: def i n i t ( s e l f ): s e l f . neighbors = [ ] def add neighbor ( self , other ): s e l f . neighbors . append( other ) other . neighbors . append( s e l f ) def COUNT(G, s ): (V, E) = G # V: l i s t of Node ; E: l i s t of tuples of 2 Nodes for edge in E: edge [ 0 ] . add neighbor ( edge [1]) cnt = {} queue = [ s ] while queue != [ ] : L = len(queue) print(L) for u in queue [ : L ] : cnt [u] = cnt . get (u, 0) + 1 print(u. neighbors ) for v in u. neighbors : if v not in cnt . keys (): queue . append(v) queue = queue [L : ] L = [( item [0] , item [1]) for item in cnt . items ()] return L This algorithm is correct because all the nodes that have the same shortest-path-length are on the same level of the BFS tree, so by counting the times a node appear in the shallowest level where it appears, we can count the number of shortest paths to the node. Because only the shortest path will be considered, each edge will be visited once and only once. The sum of the number of times each node is visited is O(m), therefore, this algorithm is O(m) (and consequently O(m + n)). Problem 3: Reaching the one [15 pts] Let n ≥ 3 be a given positive integer and we are going to play a game as follows. During the game, we have a number k which is initially equal to 2. In each round of the game, we are allowed to change the current number k to a new number in one of the following ways: • Type-A: Change k to (k + 1) mod n. • Type-B: Change k to a divisor of k that is greater than 1. • Type-C: Change k to k2 mod n. Note that the above three rules guarantee that our number k is always in the range {0,1,…,n−1} during the entire game. The game terminates when k becomes 1. Our goal is to finish the game in fewest rounds. For example, suppose n = 91 and we can play the game as follows. • Round 1: 2 =⇒ 3 using Type-A change. • Round 2: 3 =⇒ 9 using Type-C change. • Round 3: 9 =⇒ 81 using Type-C change. • Round 4: 81 =⇒ 27 using Type-B change. • Round 5: 27 =⇒ 1 using Type-C change. As you can verify, this is an optimal solution. Design an algorithm Game(n) which returns an optimal solution to the game as a list. So Game(91) should return [2,3,9,81,27,1] (or another solution with 5 rounds). Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(nlogn) time. (Hint: Formulate the problem as a graph problem with n vertices and O(nlogn) edges.) Solution. Please write down your solution to Problem 3 here. We construct a graph based on the problem setting: all the possible remainders of n: 0,1,…,n−1 are constructed as n vertices. For Type-A operation, we connect vertex k to vertex k+1 for k = 0,1,…,n−2, and connect n−1 to 0. For Type-B operation, we connect vertex k to all its divisors larger than 1. For Type-C operation, we connect vertex k to k2(mod n) for k = 2,…,n − 1. Next, we consider the total number of directed edges. For Type-A, the number of directed edges is n; for Type-B, the total number of edges is ; for Type-C, the total number of edges is n − 2. Therefore, the total number of directed edges 1) . By integral test, +1, where ). Therefore, )). Also, n is O(nlog(n)), hence the number of directed edges is O(nlog(n)). Next, we perform a BFS search on this graph G. We use a queue to maintain the BFS procedure. We first append the starting vertex 2 to the queue. Every time, for the first vertex in the queue, if it’s 1, we quit the BFS; otherwise we append all of its unvisited neighbors to the end of the queue, and mark the current vertex the previous vertex of all its neighbors (if a neighbor does not have a recorded previous vertex). This algorithm can always find the shortest path between 2 and 1, because the nearest vertex will always come to the head of the queue. So when 1 is visited, the path is guaranteed to be shortest. The time complexity of BFS is O(nlog(n)) because the total number of edges in the graph is O(nlog(n)), and each edge is visited at most once. The time needed to construct the graph is also O(nlog(n)) because each edge takes O(1) time to construct. Therefore, the overall complexity of the algorithm is O(nlog(n)). The pseudocode is given as follows: class Vertex : def i n i t ( self , val ): s e l f . val = val s e l f . neighbors = [ ] def add neighbor ( self , other ): s e l f . neighbors . append( other ) def GAME(n ): V = [ Vertex ( i ) for i in range(n )] # Type A edge for i in range(n − 1): V[ i ] . add neighbor (V[ i + 1]) # Type B edge for i in range(2 , n ): for j in range(2 ∗ i , n, i ): V[ j ] . add neighbor (V[ i ]) # Type C edge for i in range(2 , n ): if i ∗∗ 2 % n != i : V[ i ] . add neighbor (V[ i ∗∗ 2 % n ]) # BFS visited = [0 for i in range(n )] prev = [−1 for i in range(n )] queue = [2] while queue != [ ] : head = queue [0] queue = queue [ 1 : ] visited [ head ] = 1 if head == 1: break for n in V[ head ] . neighbors : if not visited [n. val ] : queue . append(n. val ) if prev [n. val ] == −1: prev [n. val ] = head sol = [1] while sol [−1] != −1: sol . append( prev [ sol [ −1]]) return sol [ −2:: −1] Problem 4: Finding a strongly connected component [10 pts] Given a directed graph G = (V,E) and a vertex s ∈ V , we want to compute the strongly connected component of G containing s. Design an algorithm SCC(G,s) which returns the set of vertices of G that lie in the same strongly connected component as s. Describe the basic idea and give the pseudocode. Briefly justify the correctness and analyze the time complexity. Your algorithm should run in O(n + m) time, where n = |V | and m = |E|. (Hint: By definition, a vertex v is in the same strongly connected component as s iff v is reachable from s and s is reachable from v. We already know how to find the vertices reachable from s using DFS. Can you use a similar idea to compute the vertices from which s is reachable?) Solution. Please write down your solution to Problem 4 here. For each node u, we store all the nodes v that (u,v) ∈ E as its out-neighbors and all the nodes w that (w,u) ∈ E as its in-neighbors. Then we perform DFS based on out-neighbors to obtain all the nodes that s can reach, and DFS based on in-neighbors to obtain all the nodes that can reach s. Lastly, we take the intersection of the two results, so that we can obtain the set of points in the strongly connected components containing s. This algorithm is correct, as we know DFS can reach a point t iff it’s reachable from s. If we reverse the directions of all edges, and perform DFS again, a point t is reached iff ∃ a path consisting of reversed from s to t, i.e., ∃ a edge from t to s, s is reachable from t. By taking the union of the results obtained by the 2 DFS’s, a node t is kept iff it is reachable from s and it can reach s, i.e., it is strongly connected to s. The time complexity to process in-neighbors and out-neighbors is O(m). The time complexity of DFS is O(n + m). Therefore, the time complexity of the whole algorithm is O(n + m). The pseudocode is as follows: class Node: def i n i t ( self , val ): s e l f . val = val s e l f . neighbors = [ [ ] , [ ] ] # 0 for in , 1 for out def add neighbor ( self , other ): s e l f . neighbors [ 1 ] . append( other ) other . neighbors [ 0 ] . append( s e l f ) # add a edge from s e l f to other def SCC(G, s ): (V, E) = G for (u, v) in E: u. add neighbor (v) res = [ set () , set ()] # res [0] for a l l the nodes can be reached from s ; # res [1] for a l l the nodes that can reach s def DFS(u, direction = 0): # direction = 0 for in−neighbors , 1 for out−neighbors res [ direction ] . add(u) for v in u. neighbors [ direction ] : if v not in res [ direction ] : DFS(v , direction = direction ) DFS(s , 0) DFS(s , 1) ans = res [ 0 ] . intersection ( res [1])ans . remove( s ) return ans
CSC718 Operating Sys & Parallel Programming Homework 4 MPI, OpenMP, and GPU Programming Assignments Total: 100 points The homework assignment will be graded based on the following criteria: • Accuracy: 1) the solution meets specific requirements in the problem description; 2) the solution produces correct results; 2) the procedures adopted in the solution are technically sound. • Efficiency: efficiency will be one of the criteria when grading programing assignment. The solution should produce the desired results efficiently. • Effort/neatness: the solution includes excellent effort, and all relate work is shown neatly and organized well. Homework assignment feedback will be available through the DropBox folder on D2L. For all the programming assignments, you can choose any operating systems you want. I will usually provide C/C++ samples for the programming assignments. If you prefer to use other languages, e.g., Java, they are accepted too. A README.txt is required to submit any programming assignments. In the README.txt, you need to provide the following information: 1) How to compile your program? 2) How to run your program? 3) What is the average running time of your program? 4) What are the expected output results when I run your program? Zip all you source code, project files, supporting files, and README.txt and submit the all-in-one zip file together to the D2L Dropbox. If you have any questions about the homework, please let me know. (10 points) What are the strengths and limitations of GPU programing? (15 points) Among the four parallel frameworks, multithreaded programming, MPI programming, openMP programming, and GPU programming (e.g., CUDA), we discussed in the class, what will be the strategies you are going to use in general when selecting a parallel framework for your applications? (10 points) Benchmarking of a sequential program reveals that 95 percent of the execution time is spent inside functions that are amendable to parallelization. What is the maximum speed up we could expect from executing a parallel version of this program on 10 processors using Amdahl’s Law? The “acceptable” identifier must meet the following constraints:CSC718 Operating Sys & Parallel Programming • The identifier is a six-digit combination of the numerals 0-9 1. (15 points) Write a sequential program to count the different identifiers given these constraints. 2. (20 points) Convert the sequential program to an MPI parallel program. 3. (20 points) (choose only one: A or B, not both) A. Convert the sequential program to an openMP parallel program. B. Convert the sequential program to a GPU program using a parallel framework at your choice (e.g., CUDA, OpenCL, etc.) 4. (10 points) Benchmark the three programs based on your choice in 3). Problem 1 process/ 1 thread 2 processes / 2 threads 3 processes / 3 threads 4 processes / or 4 threads Program1.c (sequential) n/a n/a n/a Program2.c (MPI) Program3.c (openMP or CUDA)
CS771A Assignment 3The Boys Rajarshi Dutta Udvas Basak Shivam Pandey 200762 201056 200938 1 Question 1 The first part of the solution essentially uses a linear model named Elastic Regression, which takes into account both the effects of Ridge Regression Penalty – L2 loss and Lasso Regression Penalty – L1 loss. The objective/loss function of the model is represented below: C(w) = argmin(1) w where, • w′ = Weights multiplied with each of the features present in the dataset taken into account for the construction of the regression model. • α = constant that multiplies both the lasso and ridge penalties. • λ1 = Regularization parameter for the Ridge regression penalty. • λ2 = Regularization parameter for the Lasso regression penalty. During the regularization procedure, the λ1 parameter leads to the formation of a sparse model, whereas the quadratic section of the penalty makes the part of λ1 portion more stable to the path of regularization, decreases the number of variables to be selected thereby promoting the grouping effect. The grouping effect enables the variables to be easily identified by correlation leading to the enhancement of the sampling procedure. This method first finds the Ridge Regression coefficients and then conducts the second step by using a Lasso sort of shrinkage of the coefficients. Inorder to determine the best performing linear model, Randomized Search Cross Validation technique has been applied for efficient hyperparameter selection out of the following sets of hyperparameters of the model. Here grid is a dictionary that contains the different hyperparameters of the Elastic Net Regression model as keys and its values. grid = dict() grid[’alpha’] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 0.2, 1.0] grid[’l1_ratio’] = np.arange(0,1, 0.01) grid[’selection’] = [’cyclic’, ’random’] grid[’max_iter’] = [2, 10, 100, 1000] grid[’tol’] = [1e-4, 1e-3, 1e-5, 1e-2, 1e-1, 0.0] The Randomized Search Cross Validation returns the best estimator with the following hyperparameters ElasticNet(alpha=0.1, l1_ratio=0.62, selection=’random’, tol=0.001, warm_start=True) Preprint. Under review. The corresponding MAE score on the training data are 5.6248 (for the O3 levels) and 6.5136 (for the NO2 levels) respectively. 2 Question 2 A suitable non-linear model which can effectively fit on the given dataset is K Nearest Neighbour algorithm. The KNN can be both used as a classification and regression model, which essentially classifies unknown data points based on the distance with respect to k nearest data points. The model essentially works on two different kinds of distance parameters, namely Euclidean distance, Minkowski distance and the Manhattan distance. One strength of the KNN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships. In the case of regression tasks, the algorithm chooses the k nearest points based on the given distance parameter. It calculates the new point to be approximately equal to the average of these nearest points. Also here Distance-weighted k-nearest-neighbor rule should be such where that it varies with the distance between the sample and the considered neighbor in such a manner that the value decreases with increasing sample-to-neighbor distance. (2) Here, the optimal value of K has been calculated by plotting the Train MAE loss and Test MAE loss for the case of both O3 and NO2 sensor values. The MAE loss variation for different k values has been shown below.Figure 1: MAE loss variation for K values We find that the KNN Regressor provides the minimum Mean absolute error for a k value of 5. model = KNeighborsRegressor(n_neighbors=5, P = 1, algorithm=”auto” , n_jobs = -1) Some other models tested for the given regression problem includes Xgboost, Multi-layer Perceptron, Random Forest Regressor models. • XGBoost Regressor : XGBoost uses a combination of two techniques, gradient boosting and tree boosting, to improve the accuracy of the model. Gradient boosting involves minimizing the loss function mean absolute error by iteratively adding weak learners (e.g., decision trees) to the model. Tree boosting, on the other hand, involves optimizing the split points of the decision tree by minimizing the loss function. • Random Forest Regressor : The algorithm works by creating a large number of decision trees, each of which is trained on a subset of the data and a subset of the features. The 2 final prediction is then made by averaging the predictions of all the trees in the forest. Random forest regression also includes several regularization techniques to prevent overfitting and improve the generalization of the model. These techniques include maximum depth, minimum samples per leaf, and maximum features. • Arttifical Neural Network : The given neural network consists of 64 followed by 32 and 8 layers. The optimizer used is Adam, and the kernel initialisation is He uniform and the loss function used is Mean absolute Error loss. The Perceptron is backpropagated for around 100 epochs on the training data with a batch size of 64. Non-linear Models O3 MAE loss NO2 MAE loss K Neighbours Regressor 3.3284 2.3228 XGboost Regressor 5.1433 4.7230 Random Forest Regressor 6.0784 5.3145 MultiLayer Perceptron 4.8807 4.9178 Table 1: Stats of different non-linear models 3 Question 3 The code is available in the submit.py file. Code snippet is mentioned below: import numpy as np import pickle as pkl # Define your prediction method here # df is a dataframe containing timestamps, weather data and potentials def my_predict( df ): X = np.array(df.iloc[:, 1:]) # Load your model file model = pkl.load(open(“knn_model.pkl”, “rb”)) test_preds = np.transpose(model.predict(X)) # Make two sets of predictions, one for O3 and pred_o3, pred_no2 = test_preds[0], test_preds[1] # Return both sets of predictions return ( pred_o3, pred_no2 ) another for NO2 3
CS771A Assignment 2The Boys Rajarshi Dutta 200762 Udvas Basak 201056 Shivam Pandey 200938 1 Solution 1 1.1 TheoryFigure 1: Illustration of a working ID3 algorithm We have used the Iterative Dichotomizer decision tree algorithm which is considered one of the most robustly used algorithms for supervised machine learning classification tasks. The algorithm works by recursively partitioning the training data based on their attributes until the decision tree produces the purest splits (referred to as the leaves). ID3 uses a top-down greedy approach to build a decision tree. The tree is basically constructed from the top and the greedy approach means that at each iteration the best feature is selected to split the node. 1.2 Our approach The decision tree constructed to solve the given Wordle-Solvr problem consists of the following components: 1.2.1 process node • For the process node function, every non-root node present in the ID3 tree constructed is considered and passed through the get entropy function which calculates the index of the vocabulary that minimizes entropy the most or maximizes the Information Gain. This index can be used to access the most appropriate query that can be used at that step. Preprint. Under review. • All words from list all words list are extracted and reveal function helps to uncover the mask between the query and the required word which is stored in split dict as the key. The value of the split dict contains an array of indices that returns the given mask when queried. get entropy • The get entropy function iterates over all of the words present in that node (possible candidate queries) and passes the array to the function calc entropy which returns the entropy of the current node. • Another dictionary is maintained which given the mask, corresponds to a number of queries(which when queried with the word provides that particular mask). These values are used to calculate the sum of weighted entropy after splitting the node. • The difference of the weighted sum of entropies after splitting the node with the initialentropy calculated from the parent node gives us the information gain used to produce the best split out of all words. (1) where, V = possible values in Attribute A, S = set of examples X, Sv = subset where XA = V • Information gain indirectly describes the mutual information between the attribute and theclass of labels S. calc entropy • This function simply calculates the Shannon Entropy based on the formula where p represents the probability of each class label which can be produced by the corresponding split based on their attribute. E(p) = −p · log2(p) (2) 2 Solution 2 The entire algorithm has been implemented in the submit.py file. 2