This exercise will extend the use of GPIOs and will employ the DAC and the buzzer (speaker) to generate observable audio outputs. You will also practice the use of a timer (TIM), interrupts and finally the DMA. The lab will be done in two parts, with the second part building on the success of the first one.This exercise relies on the previous laboratory exercises, the classes and tutorials, and it will focus on the use of basic hardware blocks within the processor. In addition to consulting the class notes, you should consult the processor documentation to complete this exercise. Some specific hints will be given in the tutorial and the lectures leading to this lab exercise.STM32L4+ processors have multiple timers, as described in detail in Sec. 38 of the STM32L4+ Reference Manual, available on MyCourses. These timers can generate a variety of signals and interrupts, and they are able to start DMA.You will need to use Cube MX to configure a few GPIO pins for analog output. Since DAC converts digital register values (i.e., integers) into analog values (i.e., voltages), we will use that signal to drive a speaker with an oscillating signal.You will drive two different signals on two different DAC output channels: a saw wave and a triangle wave, with as similar a frequency as possible.There are two basic paths to discovering which pins must be configured for the on-board DAC. The first, cumbersome way is through manuals, so we will ignore it here. The better path is to use MX. In the Pinout & Configuration tab, MX summarizes many of the features of the chip on the left-hand side, under categories such as System Core, Analog, Timers, etc. Under Analog, choose DAC1. Enabling OUT1 and OUT2 will automatically enable the correct pins in the appropriate mode.To configure the DAC, we need to find DAC1 under Analog in the list of features under Pinout & Configuration. If you haven’t already, in DAC1 Mode and Configuration, enable OUT1 and OUT2 in Connected to external pin only mode. Then, verify the DAC Out1 Settings and DAC Out2Settings: • Output Buffer (Enable) • Trigger (None) • User Trimming (Factory trimming) • Sample And Hold (Sampleandhold Disable)Step 1: Making Signals In Lab 2, you have read the state of a button and written to a LED (besides using ADC). In this lab, you will initialize and write to the DAC to generate signals in an audible frequency range, such that we can observe the system operation with a small speaker. Button and a LED will be used a bit differently.You should at first implement the code that manually generates two signals: a triangle wave, and a saw(tooth) wave. Without use of interrupts, it is difficult to precisely time these signals. However, do your best to generate oscillating signals with a period of ~15 ms (corresponding to 65 Hz, or note C2). You will be shown below how to observe the generated signals by a debugger before sending them to the DAC.Next, assign each signal to a different DAC output channel. To initialize the DAC and write data to it, you’ll need more HAL functions. Sections 16.2.3 and 16.2.4 of the HAL Driver User Manual list the functions you will need; they are detailed in Section 16.2.7.Note that the DAC can operate with either 8-bit (0 to 255) or 12-bit (0 to 4095) precision. You make this choice with parameters passed to the HAL driver. Recall that 8- and 16-bit integer data types are available (uint8_t and uint16_t) and using them may simplify your implementation. Further, note that HAL_Delay(…) can be used to insert a delay between operations in your code. As a reminder, the details of its usage can be found in the HAL Driver User Manual.If you generate your code and return to IDE, the DAC1 should be configured. (Hopefully you will still remember to write your own code within USER CODE BEGIN and USER CODE END, and it’s all still there!).Please take notice whether any light (LED1?) of the board blinks when this part of your project is running. Can you explain why?Step 2: Making Sounds When the waveforms look right, wire a speaker using the components available to you. Note that: (1) Your board has the same external interface as Arduino (A0-A5 and D0-D15, plus others). (2) Your speaker is different, but it fits into the breadboard with the indicated spacing. (3) The resistor is placed in series to speaker to limit the current and protect both devices.Step 3: Making Better Sounds How do the triangle and saw waves sound? Not great. Do they have the desired frequency? Not really, though we can’t really fix this without using timers and interrupts (later!). Next, generate a signal with approximately the same period as above but using the arm_sin_f32() function in the DSP library. (similar to Lab 1.) As before, trace the values before driving the speaker.Useful Notes The Debugger use in Step 1 Above While developing your code, you will spend substantial time using the debugger. Before you test your code with a speaker, use the ITM interface to verify that it is working as intended. Ensure that the Serial Wire Viewer (SWV) is enabled and configured appropriately in the debugger configuration. Since we’ll use the ITM’s data trace functionality this time, no code modifications are required (e.g., to timestamp events). Start the debugger. Once it pauses execution at the first line of main, ensure that the SWV Data Trace Timeline Graph is visible; find it under the Window > Show View > SWV pull-down menu.Before resuming execution, you need to configure (wrench) and then start recording (red button). Configure the Serial Wire Viewer to enable Comparator 0 and Comparator 1, and write the names of the variables you wish to monitor in Var/Addr. In my case, the variables that hold the current signal values are triangle and saw.If you try to specify variables for tracing when they are out of scope (e.g., you pause and the code stops inside a library), you may get a warning indicating Variable not found! Tracing will not work properly unless you configure the comparators while the variables are in scope.When you resume execution (don’t forget to record), if everything is working properly, the data trace will rapidly fill with oscillating signals. Note that at our target frequency you may have to zoom in a bit in order to distinguish your triangle and saw waves. In my case, you will observe that the period of the two signals is not exactly the same. We’ll achieve more precise timing in later labs when we use interrupts.In this part, you will improve upon the quality of the output by using the timer to control the rate of writing to the DAC. The timer can generate interrupt to execute a special function, an interrupt handler. Interrupts tend to be more efficient than polling (which is how we’ve interacted with the button), or using HAL_Delay(…) (which is how we’ve interacted with the DAC), and gives us greater control over timing, which is essential for a wide variety of applications. We’ll first use an interrupt to detect when the button has been pressed. We’ll then use a timer, and its periodic interrupt, to determine when to write new data to the DAC. Then we’ll use the timer, and direct memory access (DMA) to write the DAC; in this last case, sending values to the DAC will be handled almost entirely by hardware, leaving the processor free for other tasks.Push button Configure the push button to enable external interrupts. In the Pinout & Configuration tab, on the left, select GPIO. Under Configuration, select NVIC, and enable EXTI line[15:10] interrupts. This means that an interrupt will be generated whenever there is a signal change on the external interrupt lines; you button should wired to one interrupt, and the corresponding code action will be written for that interrupt to affect the LED, as explained on the next page.DAC For Part 2, modify DAC to use only one channel. (Remember to check the schematic; you’ll be using your speaker again, and it will not work when wired to the wrong output.)Timer The timer will help to update the DAC output at regular intervals. What’s an appropriate interval? CD-quality audio is sampled (and reproduced) at 44.1kHz. Voice call audio can be sampled at lower rates.Choose a sampling rate (e.g., 44.1 kHz). Given your system clock frequency (e.g., 80 MHz), calculate the counter period (the maximum value of the counter) to achieve this sampling rate. In this lab, the timer pre-scaler is not necessary. Finally, under Parameter Settings, set the Trigger Event Selection TRGO to Update Event, and under NVIC Settings, enable TIM2 global interrupt. Together, these settings ensure that (a) when the timer elapses, execution in main() is interrupted; and, (b) the callback function (defined below) is executed.Step 1: Implementing Push-button Interrupts An interrupt is a signal (internal or external) that prompts the processor to stop normal execution (e.g., in main()), and begin executing an interrupt service routine (ISR) handler, a function responding to the interrupt event. In Lab 2, the code polled (checked over and over and over again) for changes in the push button signal; in this lab, you will write a function that is executed whenever the push button interrupt occurs, such that the LED shows the value of the button (1/0).Section 31.2 of the HAL Driver User Manual details the functions used to interact with GPIO. What we’re interested in, in particular, is HAL_GPIO_EXTI_Callback(…). This function is called by the GPIO external interrupt handler, and we can control what it does in main.c by simply writing a new definition; our new function is automatically used instead of the weakly defined original. Write this function in main.c (be sure to respect the function prototype defined in the HAL manual) so that it toggles the LED. This function takes as an argument the pin that caused the interrupt; it’s good programming practice to verify that the interrupt was caused by the pin we think it was. This isn’t essential for our lab, since there are no other external interrupts, but is necessary when a single callback function may need to handle various interrupt sources.Note: remember again to put your code in a USER CODE region so that it doesn’t disappear when we go back to MX to modify our configuration. We’ll be using the push button again later.Step 2: Implementing Timer-driven DAC Output Now, write a callback function for the timer. Section 72.2 of the HAL Driver User Manual details the functions used to interact with timers. You are particularly interested in two sets of functions: the TIM Base functions, and TIM Callback functions. You want to start the timer WITH global interrupts enabled, i.e., in interrupt mode. Read the function definitions carefully, so that you start your timer in the correct mode. (Yes, you need to call a function to start the timer; don’t forget to do so, as this is an otherwise very frustrating problem to debug.)Just like for the button, HAL_TIM_PeriodElapsedCallback(…) is called by the TIM interrupt handler. Write a new definition for it in main.c. Again, it’s good programming practice to verify that the timer causing the interrupt (an argument passed to your function) is actually the one you want to respond to.In this function, you’ll send a new value to the DAC (see Part 1 and Section 16.2 of the HAL Driver User Manual). You cannot pass it as an argument, because you don’t call this function; it is called asynchronously in an entirely hardware-controlled process, and the only argument is the timer that caused the interrupt.What you can do, however, is put the DAC values in a global variable (defined outside of any function, like other variables in main.c). You don’t have control over when the timer will elapse and the callback is called; you need to prepare all the DAC values, and save them in global variables that can be accessed by the callback function.In main(…), write code to populate an array with a sine wave (use the ARM math library). You can “play” this wave on our speaker (using the same circuit as in Lab 2). To get the best possible results: • Pick a wave frequency in the 1-2 kHz range (~C6-C7 in music parlance). Lower frequencies are harder to hear; higher frequencies too, depending on your hearing abilities. • Note that the timer frequency is (and must be) higher than that of the signal you want to drive; how do you ensure that your desired frequency is realized? (Nyquist to the rescue) • Note that the number of saved samples matters; if you save samples for anything other than 2nπ radians, you will have a discontinuity from the end to the beginning of the array, causing distortion.• Scale your DAC values so they vary over about 2/3 of the possible dynamic range. The chip will dynamically clamp GPIO outputs to prevent damage, limiting their current to 20 mA. If you attempt to use the full range of DAC output, the signal will look fine under high impedance (e.g., with a voltmeter or pocket oscilloscope), but will clip when connected to the speaker, causing distortion.Using a global array defining the values to be sent to the DAC, write your implementation of the timer callback so that it sends a new value from this array to the DAC each time it is called.Next, change our code to use direct memory access (DMA). DMA uses an on-chip peripheral that can be programmed to perform memory accesses. In this case, DMA will read our array of sine values and write to the DAC for us. This means that we no longer need to execute code in the timer interrupt callback, saving CPU cycles for other tasks (if we had any) or reducing power.To use DMA, you need to reconfigure the DAC. Instead of using our timer to trigger a callback that sets the DAC value, we’ll use our timer to trigger the DAC itself. Return to MX. The first thing we need to change, then, is to select the appropriate trigger in Parameter Settings. Under Trigger, choose the trigger out event corresponding to your timer.Next, we need to set up DMA. Go to DMA Settings, and add a DMA request. Choose Circular mode; this means the DAC will repeatedly read from the array, starting over from the beginning when the end is reached. Normal mode implies that the array would be read and transferred once. Choose the appropriate data width for your software; e.g., I’ve used 8-bit resolution for my DAC, and a uint8_t array for my sine waves, and therefore want DMA to transfer bytes.Now regenerate your code. Comment out or otherwise disable your timer callback; it is no longer needed. In fact, the global interrupt for your timer isn’t necessary at all, and can be disabled (though it won’t hurt anything). The last thing to do is change how you start the DAC, to start it in DMA mode (Section 16.2 of the HAL Driver User Manual).Step 4: Putting it All Together and Multiple Tone Generation Finally, combine functionality into something more sophisticated. Expand your code so that when the button is pushed, the tone played on the speaker changes. Select at least three different tones; an arpeggio (e.g., C6, E6, G6) would suit the purpose well, but anything else is fine, too. Use interrupts and DMA.Experimental Results to Demo You are asked to reach the following milestones. Grading • C implementation of signals (triangle, saw, sine) in Part 1 o 10% • Visualization of signals (triangle, saw, sine) in Part 1 o 10% • Audible confirmation of signals (triangle, saw, sine) in Part 1 o 10% • Pushbutton interrupt o 10% • Timer interrupt for driving DAC o 10% • DMA driving the data o 20% • Multiple tone audio generation o 10% • Working demo organization and success o 20%Final Report Once you have all the parts working, include all the relevant data to your report. The report should concisely explain your solution to the problem given, including the final code. You should use the established 2-column IEEE format. Please capture the screen shots and relevant code snippets, and include them in the Appendix. All code should be well documented. Any performance evaluation and correctness validation should be apparent from your written report.Due Dates The first two labs will be completed in several phases, over the three weeks. First, you should take time to understand the lab and ask any questions in regular lab sessions or through discussion groups.There will be the first lab demonstration on Mar. 8-10th , by which time you should solve Part 1, and be able to demo and explain how you approach the exercise. Please note that you will be asked to cycle through 3 different signal shapes. The final demonstration in which all the parts (interrupts, timer, DMA) are put together will be on Mar. 15th and 17th and will include showing your source code and demonstrating a working program for all test cases. The final report will be due on Friday, Mar. 18th
You have developed in the previous laboratory exercise a complete C/assembly code for the STM32L4+ processor. This lab exercise will teach you to use the General-Purpose Input/Output (GPIO) pins of the processor, as well as some of the built-in blocks that track the physical status of the processor. There are two readily available physical parameters that the STM32L4+ processor can read: its core temperature and its voltage.The program that you are asked to build will allow you to switch between updating the readout of the temperature and voltage each time the blue button is pressed on your development kit. The two variables expressing these two physical values will be observable by means of “watched variables” that the processor and the IDE allow you observing during the ongoing processor operationThis exercise relies on the previous laboratory exercise, all the classes and tutorials, and it will focus on the use of basic hardware blocks within the processor. In addition to consulting the class notes, you should consult the processor documentation to complete this exercise. Some specific hints will be given in Tutorial 2 and the lectures leading to this lab exercise.You saw in the class that GPIO pins can be programmed to perform digital bit input reading, digital pin output, various special functions, as well as the analog quantity reading/writing. You will exercise the first two capabilities in this lab.The simplest GPIO operation is that of outputting a zero or one, and you can test this operation by driving the green LED light on your board. From section 6.12 of the board user manual available on MyCourses, you will see the pin PB14 drives LED2. To drive that pin, you have to enable it in Cube MX (started by double-clicking the .ioc file of your project) and declare as an output (by selecting System Core->GPIO, and then the pin PB14). The tool will allow you to label this pin with a useful mnemonic, as well as set some other parameter (e.g., pull-up or -down). Consider carefully the suitable values for those by consulting the board schematic and the basic electric circuits knowledge.Similarly, you can locate the blue button pin and configure it as an input. Again, consult the circuit schematic to deduce the default state (i.e., when button is not pressed) and the potential need for pull resistors to be included in the pin configuration.STM32L4+ processors are equipped by flexible high-performance ADC converters, as described in detail in Sec. 21.2 of the STM32L4+ Reference Manual, available on MyCourses. Commonly, an ADC would read analog quantities via external pins, but it is simpler to first focus on reading out a) the internal reference voltage and b) the internal temperature sensor. The simplest way to obtain a correct value from an ADC is to start conversion and then read out the value after the conversion is completed, and we will use this method.You will perform this exercise in several stages and record the results obtained at each step in your lab report. Step 1 First, you will write a main C program that detects the pressing of the button and tests the LED2 display by toggling the value every time the button pressed. It is perfectly acceptable to have this program run in an infinite loop generated by Cube MX after you configured the two GPIO pins. You should take note of the commands added for initializing the GPIOs, as it illustrates a way by which the Cube MX GUI generates correct code for your program.Step 2 You will configure the ADC1 to read out the reference voltage and augment your code to convert the value read to a correct voltage value. After generating the code from Cube MX, please take notice of the control structures instantiated for ADC1 and the values generated (especially the assignment of the reference voltage as the source for ADC conversion). You can test the obtained values by observing that variable in your code – no need to interrupt the execution.Step 3 Similar to Step 2, you should configure the ADC1 to read out the temperature, and then add a correct conversion into your program variable, such that the value of that variable is the temperature in degrees Celsius.Step 4 This is the “detective” part of your exercise, where you will compare the code obtained in steps 2 and 3, such that your program control can switch the readouts between the voltage and the temperature. It helps to consult the reference manual to find how to correctly initialize the readings “on fly” to avoid the errors in the procedure.Step 5 Produce the final program that alternates between the ADC readings and prepare it for the demonstration to a TA. The LED light should help in understanding what the processor is doing.Step 6 – optional Activate interrupt generation by pressing blue button and adjust the code to perform action associated with the button by interrupts.Useful Notes In realizing your code, you are free to use good C coding practices, such as the use of conditional compiling for retargeting to different execution cases given above. If you start your project by using the board support package (BSP) for your board, the clock and pins will be properly selected for you. It is a good practice to check the clock andThe Debugger While developing your code, you will spend substantial time using the debugger. Please follow the instruction from your tutorials to ensure proper development and debug practices.Experimental Results to Report You are asked to reach the following milestones and include the results in your report. 1. Describe the configuration of pins such that the Step 1 is correctly run on the processor, 2. Describe the ADC1 configuration for Step 2, 3. Describe the ADC1 configuration for Step 3, 4. Document the code for programmable reconfiguration of ADC1. 5. Readable and concise code/pseudo-code for your final program execution 6. Test the temperature sensor readings by (non-destructive!) heating and cooling of the processor. Fingers pressed on the processor can raise temperature a bit, and natural convection will reduce it. You can also consider the use of a hair fan.Final Report Once you have all the parts working, include all the relevant data to your report, which will be due first day in a week following the Lab 2 demonstration, but not later than 4 days after the demo.Demonstration The demonstration includes showing your source code and demonstrating a working program. Your program should be capable of manipulating a variety of test cases and should flag errors appropriately.Report The report should concisely explain your solution to the problem given, including the final code. All code should be well documented. Your report should contain a performance evaluation and correctness validation. More detail on the report will be given out in the class.Due Dates The first two labs will be completed in several phases, over the three weeks. First, you should take time to understand the lab and ask any questions in regular lab sessions or through discussion groups. There will be the first lab demonstration on Feb. 15-17th , by which time you should be able to explain to the TAs how you approach the exercise. The final demonstration in which the parts are put together and run on test cases will be on Feb. 232h and 24th and will include showing your source code and demonstrating a working program for all test cases that we will post. The final report will be due on Friday, Feb. 25th.
In this assignment, you will explore the spark GraphFrames library as well as implement your own Girvan-Newman algorithm using the Spark Framework to detect communities in graphs. You will use the ub_sample_data.csv dataset to find users who have a similar business taste. The goal of this assignment is to help you understand how to use the Girvan-Newman algorithm to detect communities in an efficient way within a distributed environment.2.1 Programming ReSuirements a. You must use Python and Spark to impeement aee tasks. There will be 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You can use the Spark DataFrame and GraphFrames eibrary for task1, but for task2 you can ONLY use Spark RDD and standard Python or Scaea eibraries. (ps. For Scaea, you can try GraphX, but for the assignment, you need to use GraphFrames.)2.2 Programming Environment Python 3.6, Scaea 2.11 and Spark 2.3.2 We will use Vocareum to automatically run and grade your submission. You must test your scripts on the eocae machine and the Vocareum terminae before submission.2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the reSuired functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.2.4 What you need to turn in You need to submit the following files on Vocareum: (all lowercase) a. [REQUIRED] two Python scripts, named: task1.py, task2.py b1. [REQUIRED FOR SCALA] two Scala scripts, named: task1.scaea, task2.scaea b2. [REQUIRED FOR SCALA] one jar package, named: hw4.jar c. [OPTIONAL] You can include other scripts called by your main program d. You don’t need to include your results. We will grade on your code with our testing data (data will be in the same format).You will continue to use Yelp dataset. We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id and business_id. You can download it from Vocareum.4.1 Graph Construction To construct the social network graph, each node represents a user and there will be an edge between two nodes if the number of times that two users review the same business is greater than or equivaeent to the filter threshold. For example, suppose user1 reviewed [business1, business2, business3] and user2 reviewed [business2, business3, business4, business5]. If the threshold is 2, there will be an edge between user1 and user2. If the user node has no edge, we wiee not inceude that node in the graph. In this assignment, we use fieter threshoed 7.4.2 Task1: Community Detection Based on GraphFrames (2 pts) In task1, you will explore the Spark GraphFrames library to detect communities in the network graph you constructed in 4.1. In the library, it provides the implementation of the Label Propagation Algorithm (LPA) which was proposed by Raghavan, Albert, and Kumara in 2007. It is an iterative community detection solution whereby information “flows” through the graph based on underlying edge structure. For the details of the algorithm, you can refer to the paper posted on the Piazza. In this task, you do not need to implement the algorithm from scratch, you can call the method provided by the library. The following websites may help you get started with the Spark GraphFrames: https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guidepython.html https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html4.2.1 Execution Detaie The version of the GraphFrames should be 0.6.0. For Python: • In PyCharm, you need to pip install graphframes os.environ[“PYSPARK_SUBMIT_ARGS”] “–packages graphframes:graphframes:0.6.0 • In the terminal, you need –packages graphframes:graphframes:0.6.0 For Scala: • In Intellij IDEA, you need “graphframes” % “graphframes “org.apache.spark” %% “ • In the terminal, you need –packages graphframes:graphframes:0.6.0 For the parameter “maxIter” of 4.2.2 Output Resuet In this task, you need to save your one community and the format ‘user_id1 Your result should be firstly sorted then the first user_id in the community string). The user_ids in each community If there is oney one node in the Figure4.3 Task2: Community Detection to add the sentence below into your code os.environ[“PYSPARK_SUBMIT_ARGS”] = ( graphframes:graphframes:0.6.0-spark2.3-s_2.11″) need to assign the parameter “packages” of the spark graphframes:graphframes:0.6.0-spark2.3-s_2.11 need to add library dependencies to your project graphframes” % “0.6.0-spark2.3-s_2.11” “spark-graphx” % sparkVersion need to assign the parameter “packages” of the spark graphframes:graphframes:0.6.0-spark2.3-s_2.11 of LPA method, you shoued set it to5. your result of communities in a txt file. Each is: user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, … sorted by the size of communities in the ascending community in eexicographicae order (the user_id community should also be in the eexicographicae the community, we stiee regard it as a vaeid community. Figure 1: community output file format Detection Based on Girvan-Newman aegorithm spark-submit: spark-submit: Each line represents ascending order and user_id is type of order. community. aegorithm (6 pts) In task2, you will implement your in the network graph. Because need to construct the graph again to the Chapter 10 from the Mining For task2, you can ONLY use Spark deeete your code that imports 4.3.1 Betweenness Caecueation In this part, you will calculate structed in4.1. Then you need to (‘user_id1 Your result should be firstly sorted then the first user_id in the tuple two user_ids in each tuple should your result. Figure 4.3.2 Community Detection You are reSuired to divide the highest modularity. The formula According to the Girvan-Newman the betweenness. The “m” in the The “A” in the formula is the adjacent step, “m” and “A” should not be If the community oney has one You need to save your result in task1. 4.4 Execution Format Execution exampee: your own Girvan-Newman algorithm to detect the Because you task1 and task2 code will be executed again in this task following the rules in section 4.1. ning of Massive Datasets book for the algorithm Spark RDD and standard Python or Scala libraries. graphframes. Caecueation (3 pts) the betweenness of each edge in the originae to save your result in a txt file. The format of user_id1’, ‘user_id2’), betweenness vaeue sorted by the betweenness values in the descending tuple in eexicographicae order (the user_id is type should also in eexicographicae order.You do not Figure 2: betweenness output file format (3 pts) the graph into suitable communities, which reaches formula of modularity is shown below: Newman algorithm, after removing one edge, you should the formula represents the edge number of the adjacent matrix of the originae graph. (Hint: be changed). user node, we stiee regard it as a vaeid community. in a txt file. The format is the same with the the communities separately, you 4.1.You can refer details. libraries. Remember to originae graph you coneach line is descending order and type of string). The not need to round reaches the global should re-compute the originae graph. In each remove community. output file from Python: spark-submit –packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 task1.py spark-submit task2.py Scala: spark-submit –packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 –-class task1 hw4.jar spark-submit –-class task2 hw4.jar Input parameters: 1. : the filter threshold to generate edges between user nodes. 2. : the path to the input file including path, file name and extension. 3. : the path to the betweenness output file including path, file name and extension. 4. : the path to the community output file including path, file name and extension. Execution time: The overall runtime limit of your task1 (from reading the input file to finishing writing the community output file) is 200 seconds.The overall runtime limit of your task2 (from reading the input file to finishing writing the community output file) is 250 seconds. If your runtime exceeds the above limit, there will be no point for this task. 5. About Vocareum a. You can use the provided datasets under the directory resource: /asnlib/publicdata/ b. You should upload the reSuired files under your workspace: work/ c. You must test your scripts on both the local machine and the Vocareum terminal before submission. d. During submission period, the Vocareum will automatically test task1 and task2. e. During grading period, the Vocareum will use another dataset that has the same format for testing. f. We do not test the Scala implementation during the submission period. g. Vocareum will automatically run both Python and Scala implementations during the grading period. h. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on your last submission.6. Grading Criteria (% penalty = % penalty of possible points you get) a. You can use your free 8-day extension separately or together. You must submit a late-day reSuest via https://forms.gle/worKTbCRBWKQ6jSu6. This form is recording the number of late days you use for each assignment. By default, we will not count the late days if no reSuest submitted. b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala. c. There will be no point if your submission cannot be executed on Vocareum.d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your assignments if there is a grading error. No exceptions. e. There will be 20% penalty for the late submission within one week and no point after that.
In this assignment, you will implement the SON algorithm using the Apache Spark Framework. You will develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world dataset generated from Yelp dataset. The goal of this assignment is to apply the algorithms you have learned in class on large datasets more efficiently in a distributed environment.2.1 Programming Requirements a. You must use Python to implement all tasks. There will be 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any point if you use Spark DataFrame or DataSet.2.2 Programming Environment Python 3.6, Scala 2.11 and Spark 2.3.2 We will use Vocareum to automatically run and grade your submission. We highly recommend that you first test your script on your local machine and then submit to Vocareum.2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.2.4 What you need to turn in a. Three Python scripts, named: (all lowercase): task1.py, task2.py, preprocess.py b1. [OPTIONAL] two Scala scripts, named: (all lowercase) task1.scala, task2.scala (No need to write preprocessing code in Scala) b2. [OPTIONAL] one jar package, named: hw2.jar (all lowercase) Note. You don’t need to include your output files. We will grade on your code with our testing data (data will be in the same format).In this assignment, you will use one simulated dataset and one real-world dataset. In task 1, you will build and test your program with a small simulated CSV file that has been provided to you. For task 2, you need to generate a subset using business.json and review.json from the Yelp dataset (https://drive.google.com/drive/folders/1-Y4H0vw2rRIjByDdGcsEuor9VagDyzin? usp=sharing) with the same structure as the simulated data. Figure 1 shows the file structure, the first column is user_id and the second column is business_id. In task2, you will test your code with this real-world data. Figure 1: Input Data Format We will only provide submission report for small1.csv on Vocareum for task 1. No submission report will be provided for task2. You are encouraged to used command line to run the code for small2.csv as well as for task2 to get a sense of the running time.In this assignment, you will implement the SON algorithm to solve all tasks (Task 1 and 2) on top of Apache Spark Framework. You need to find all the possible combinations of the frequent itemsets in any given input file within the required time. You can refer to Chapter 6 from the Mining of Massive Datasets book and concentrate on section 6.4 – Limited-Pass Algorithms. (Hint: you can choose either A-Priori, MultiHash, or PCY algorithm to process each chunk of the data)4.1 Task 1: Simulated data (4 pts) There are two CSV files (small1.csv and small2.csv) provided on the Vocareum in your workspace. The small1.csv is just a sample file that you can used to debug your code. For task1, we will test your code on small2.csv for grading. In this task, you need to build two kinds of market-basket model. Case 1 (2 pts): You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each user containing the business ids reviewed by this user. If a business was reviewed more than once by a reviewer, we consider this product was rated only once. More specifically, the business ids within each basket are unique.The generated baskets are similar to: user1: [business11, business12, business13, …] user2: [business21, business22, business23, …] user3: [business31, business32, business33, …] Case 2 (2 pts): You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each business containing the user ids that commented on this business. Similar to case 1, the user ids within each basket are unique. The generated baskets are similar to: business1: [user11, user12, user13, …] business2: [user21, user22, user23, …] business3: [user31, user32, user33, …] Input format: 1. Case number: Integer that specifies the case. 1 for Case 1 and 2 for Case2. 2. Support: Integer that defines the minimum count to qualify as a frequent itemset. 3. Input file path: This is the path to the input file including path, file name and extension. 4. Output file path: This is the path to the output file including path, file name and extension. Output format: 1. Runtime: the total execution time from loading the file till finishing writing the output file You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”. 2. Output file: (1) Output-1 You should use “Candidates:”as the tag. For each line you should output the candidates of frequent itemsets you find after the first pass of SON algorithm, followed by an empty line after each fequent-X itemset combination list.The printed itemsets must be sorted in lexicographical order. (Both user_id and business_id have the data type “string”.) (2) Output-2 You should use “Frequent Itemsets:”as the tag. For each line you should output the final frequent itemsets you found after finishing the SON algorithm. The format is the same with the Output-1. The printed itemsets must be sorted in lexicographical order. Here is an example of the output file: Both the output-1 result and output-2 should be saved in ONE output result file “firstname_lastname_task1.txt”. Execution example: Python: spark-submit firstname_lastname_task1.py Scala: spark-submit –class firstname_lastname_task1 firstname_lastname_hw2.jar4.2 Task 2: Yelp data (4 pts) In task2, you will explore the Yelp dataset to find the frequent business sets (only case 1). You will jointly use the business.json and review.json to generate the input user-business CSV file yourselves. (1) Data preprocessing You need to generate a sample dataset from business.json and review.json (https:// drive.google.com/drive/folders/1-Y4H0vw2rRIjByDdGcsEuor9VagDyzin?usp=sharing) with following steps: 1. The state of the business you need is Nevada, i.e., filtering ‘state’== ‘NV’. 2. Select “user_id” and “business_id” from review.json whose “business_id” is from Nevada. Each line in the CSV file would be “user_id1, business_id1”.3. The header of CSV file should be “user_id,business_id” You need to save the dataset in CSV format. Figure 3 shows an example of the output file Figure 2: user_business file You need to submit the code and the output file of this data preprocessing step. The preprocessing code will NOT be graded. We will use different filters to generate another dataset for grading. (2) Apply SON algorithm The requirements for task 2 are similar to task 1. However, you will test your implementation with the large dataset you just generated. For this purpose, you need to report the total execution time.For this execution time, we take into account also the time from reading the file till writing the results to the output file. You are asked to find the frequent business sets (only case 1) from the file you just generated. The following are the steps you need to do: 1. Reading the user_business CSV file in to RDD and then build the case 1 market-basket model; 2. Find out qualified users who reviewed more than k businesses. (k is the filter threshold); 3. Apply the SON algorithm code to the filtered market-basket model; Input format: 1. Filter threshold: Integer that is used to filter out qualified users 2. Support: Integer that defines the minimum count to qualify as a frequent itemset.3. Input file path: This is the path to the input file including path, file name and extension. 4. Output file path: This is the path to the output file including path, file name and extension. Output format: 1. Runtime: the total execution time from loading the file till finishing writing the output file You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”. 2. Output file The output file format is the same with task 1. Both the intermediate results and final results should be saved in ONE output result file “firstname_lastname_task2.txt”.Execution example: Python: spark-submit firstname_lastname_task2.py Scala: spark-submit –class firstname_lastname_task2 firstname_lastname_hw2.jar 5. Evaluation Metric Task 1: Task 2: 6. Grading Criteria (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together, and you need to email your TA to indicate that you are using the free days within 24 hours of your submission. 2. There will be 10% bonus for each task (i.e., 0.3 pts, 0.2 pts, 0.3 pts) if your Scala implementations are correct. Only when your Python results are correct, the bonus of using Scala will be calculated.There is no partial point for Scala. 3. There will be no point if your programs cannot be executed on Vocareum Please start your assignment early! You can resubmit on Vocareum. We will grade your last submission. 4. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your assignments if there is a grading error. No exceptions. 5. There will be 20% penalty for the late submission within a week and no point after a week.6. There will be no point if the total execution time exceeds Section 6 evaluation metric 7. If the outputs of your program are unsorted or partially sorted, there will be 50% penalty. Input File Case Support Runtime (sec) small2.csv 1 4
In this assignment, you are going to implement three streaming algorithms. In the first two tasks, you will generate a simulated data stream with the Yelp dataset and implement the Bloom Filtering and Flajolet-Martin algorithm. In the third task, you will do some analysis using Fixed Size Sample (Reservoir Sampling).2.1 Programming Requirements a. You must use Python and Spark to implement all tasks. There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are not required to use Spark RDD in this assignment. c. You can only use standard Python libraries, which are already installed in the Vocareum.2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use the above library versions to compile and test your codes. You are required to make sure your codes work and run on Vocareum otherwise we won’t be able to grade your code.2.3 Important things before starting the assignment: 1. If we cannot call myhashs(s) in task1 and task2 in your script to get the hash value list, there will be a 50% penalty. 2. We will simulate your bloom filter in the grading program simultaneously based on your myhashs(s) outputs. There will be no point if the reported output is largely different from our simulation. 3. Please use integer 553 as the random seed for task 3, and follow the steps mentioned below to get a random number. If you use the wrong random seed, or discard any obtained random number, or the sequence of random numbers is different from our simulation, there will be a 50% penalty.2.4 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the codes we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.For this assignment, you need to use users.txt as the input file. You also need a Python blackbox file to generate data from the input file. Both users.txt and blackbox.py can be found in the publicdata directory on Vocareum. We use the blackbox as a simulation of a data stream. The blackbox will return a list of user ids from file users.txt every time we call it. Although it is very unlikely that the user ids returned from the blackbox are not unique, you are required to handle it wherever required. Please call the blackbox function like the example in the following figure: If you need to ask the blackbox multiple times, you can do it by the following sample code:4.1 Task1: Bloom Filtering (2.5 pts) You will implement the Bloom Filtering algorithm to estimate whether the user_id in the data stream has shown before. The details of the Bloom Filtering Algorithm can be found on the streaming lecture slide. Please find proper hash functions and the number of hash functions in the Bloom Filtering algorithm. In this task, you should keep a global filter bit array and the length is 69997. The hash functions used in a Bloom filter should be independent and uniformly distributed. Some possible hash functions are: f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m where p is any prime number and m is the length of the filter bit array.You can use any combination for the parameters (a, b, p). The hash functions should remain the same once you create them. As the user_id is a string, you need to convert the type of user_id to an integer and then apply hash functions to it. The following codes show one possible solution to converting the user_id string to an integer: import binascii int(binascii.hexlify(s.encode(‘utf8’)),16) (We only treat the exact same strings as the same users. You do not need to consider aliases.) Execution Details To calculate the false positive rate (FPR), you need to maintain a previous user set. The size of a single data stream will be 100 (stream_size). And we will test your code for more than 30 times (num_of_asks), and your FPRs are only allowed to be larger than 0.5 at most once.The run time should be within 100s for 30 data streams. Output Results You need to save your results in a CSV file with the header “Time,FPR”. Each line stores the index of the data batch (starting from 0) and the false positive rate for that batch of data. You do not need to round your answer. You also need to encapsulate your hash functions into a function called myhashs. The input of myhashs function is a user_id (string) and the output is a list of hash values. For example, if you have three hash functions, the size of the output list should be three and each element in the list corresponds to an output value of your hash function. Figure below is a template of myhashs function: Our grading program will also import your Python script, call myhashs function to test the performance of your hash functions and track your implementation.4.2 Task2: Flajolet-Martin algorithm (2.5 pts) In task2, you will implement the Flajolet-Martin algorithm (including the step of combining estimations from groups of hash functions) to estimate the number of unique users within a window in the data stream. The details of the Flajolet-Martin Algorithm can be found on the streaming lecture slide. You need to find proper hash functions and the number of hash functions in the Flajolet-Martin algorithm. Execution Details For this task, the size of the stream will be 300 (stream_size). And we will test your code more than 30 times (num_of_asks). And for your final result, 0.2
In this assignment, you will explore the spark GraphFrames library as well as implement your own Girvan-Newman algorithm using the Spark Framework to detect communities in graphs. You will use the ub_sample_data.csv dataset to find users who have similar business tastes.The goal of this assignment is to help you understand how to use the Girvan-Newman algorithm to detect communities in an efficient way within a distributed environment.2. Requirements 2.1 Programming Requirements a. For Task 1, you can use the Spark DataFrame and GraphFrames library. For task 2 you can ONLY use Spark RDD and standard Python or Scala libraries. There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. 2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum.2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.2.4 What you need to turn in You need to submit the following files on Vocareum: a. [REQUIRED] two Python scripts, named: task1.py, task2.py b1. [OPTIONAL, REQUIRED FOR SCALA] two Scala scripts, named: task1.scala, task2.scala b2. [OPTIONAL, REQUIRED FOR SCALA] one jar package, named: hw4.jar c. [OPTIONAL] You can include other scripts called by your main program. d. You don’t need to include your results. We will grade your code with our testing data (data will be in the same format).We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id and business_id. You can find the data on Vocareum under resource/asnlib/publicdata/.4.1 Graph Construction To construct the social network graph, assume that each node is uniquely labeled and that links are undirected and unweighted. Each node represents a user. There should be an edge between two nodes if the number of common businesses reviewed by two users is greater than or equivalent to the filter threshold. For example, suppose user1 reviewed set{business1, business2, business3} and user2 reviewed set{business2, business3, business4, business5}. If the threshold is 2, there will be an edge between user1 and user2. If the user node has no edge, we will not include that node in the graph. The filter threshold will be given as an input parameter when running your code.4.2 Task1: Community Detection Based on GraphFrames (2 pts) In task1, you will explore the Spark GraphFrames library to detect communities in the network graph you constructed in 4.1. In the library, it provides the implementation of the Label Propagation Algorithm (LPA) which was proposed by Raghavan, Albert, and Kumara in 2007. It is an iterative community detection solution whereby information “flows” through the graph based on underlying edge structure. In this task, you do not need to implement the algorithm from scratch, you can call the method provided by the library. The following websites may help you get started with the Spark GraphFrames: https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html4.2.1 Execution Detail The version of the GraphFrames should be 0.6.0. (For your convenience, graphframes0.6.0 is already installed for python on Vocareum. The corresponding jar package can also be found under the $ASNLIB/public folder. ) For Python (in local machine): ● [Approach 1] Run “python3.6 -m pip install graphframes” in the terminal to install the package. ● [Approach 2] In PyCharm, you add the sentence below into your code to use the jar package os.environ[“PYSPARK_SUBMIT_ARGS”] = “–packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 pyspark-shell” ● In the terminal, you need to assign the parameter “packages” of the spark-submit: –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 For Scala (in local machine): ● In Intellij IDEA, you need to add library dependencies to your project “graphframes” % “graphframes” % “0.8.2-spark3.1-s_2.12” “org.apache.spark” %% “spark-graphx” % sparkVersion ● In the terminal, you need to assign the parameter “packages” of the spark-submit: –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12For the parameter “maxIter” of the LPA method, you should set it to 5. 4.2.2 Output Result In this task, you need to save your result of communities in a txt file. Each line represents one community and the format is: ‘user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, … Your result should be firstly sorted by the size of communities in ascending order, and then the first user_id in the community in lexicographical order (the user_id is of type string). The user_ids in each community should also be in the lexicographical order. If there is only one node in the community, we still regard it as a valid community. Figure 1: community output file format4.3 Task 2: Community Detection Based on Girvan-Newman algorithm (5 pts) In task 2, you will implement your own Girvan-Newman algorithm to detect the communities in the network graph. You can refer to Chapter 10 from the Mining of Massive Datasets book for the algorithm details. Because your task1 and task2 code will be executed separately, you need to construct the graph again in this task following the rules in section 4.1. For task 2, you can ONLY use Spark RDD and standard Python or Scala libraries. Remember to delete your code that imports graphframes. Usage of Spark DataFrame is NOT allowed in this task.4.3.1 Betweenness Calculation (2 pts) In this part, you will calculate the betweenness of each edge in the original graph you constructed in 4.1. Then you need to save your result in a txt file. The format of each line is (‘user_id1’, ‘user_id2’), betweenness value Your result should be firstly sorted by the betweenness values in descending order and then the first user_id in the tuple in lexicographical order (the user_id is type of string). The two user_ids in each tuple should also be in lexicographical order. For output, you should use the python built-in round() function to round the betweenness value to five digits after the decimal point. (Rounding is for output only, please do not use the rounded numbers for further calculation) IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We will not regrade because of formatting issues. Figure 2: betweenness output file format4.3.2 Community Detection (3 pts) You are required to divide the graph into suitable communities, which reaches the global highest modularity. The formula of modularity is shown below: According to the Girvan-Newman algorithm, after removing one edge, you should re-compute the betweenness. The “m” in the formula represents the edge number of the original graph. (Hint: In each remove step, “m”, “k_i” and “k_j” should not be changed, while ‘A’ is calculated based on the updated graph.). In the step of removing the edges with the highest betweenness, if two or more edges have the same (highest) betweenness, you should remove all those edges. If the community only has one user node, we still regard it as a valid community.You need to save your result in a txt file. The format is the same as the output file from task 1. (Hint: For the second part of task 2, you should take into account the precision. For eg: stop the modularity calculation only if there is a significant reduction in the new modularity) 4.4 Execution Format Execution example: Python: spark-submit –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 task1.py spark-submit task2.py Scala: spark-submit –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 –-class task1 hw4.jar spark-submit –-class task2 hw4.jar Input parameters: 1. : the filter threshold to generate edges between user nodes. 2. : the path to the input file including path, file name and extension. 3. : the path to the betweenness output file including path, file name and extension. 4. : the path to the community output file including path, file name and extension. Execution time: The overall runtime limit of your task1 (from reading the input file to finishing writing the community output file) is 400 seconds.The overall runtime limit of your task 2 (from reading the input file to finishing writing the community output file) is 400 seconds. If your runtime exceeds the above limit, there will be no point for this task. 5. About Vocareum a. Dataset is under the directory $ASNLIB/publicdata/, jar package is under $ASNLIB/public/ b. You should upload the required files under your workspace: work/, and click submit c. You should test your scripts on both the local machine and the Vocareum terminal before submission. d. During the submission period, the Vocareum will automatically test task1 and task2. e. During the grading period, the Vocareum will use another dataset that has the same format for testing. f. We do not test the Scala implementation during the submission period. g. Vocareum will automatically run both Python and Scala implementations during the grading period. h. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on your last submission. 6. Grading Criteria 5. Grading Criteria (% penalty = % penalty of possible points you get)1. You can use your free 5-day extension separately or together a. https://forms.gle/edH8jw1mJjrLFRcm8 b. This form will record the number of late days you use for each assignment. We will not count late days if no request is submitted. Remember to submit the request BEFORE the deadline. 2. There will be a 10% bonus if you use both Scala and Python. 3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection.4. All submissions will be graded on the Vocareum. Please strictly follow the format provided, otherwise you can’t get the point even though the answer is correct. 5. If the outputs of your program are unsorted or partially sorted, there will be a 50% penalty. 6. We can regrade your assignments within seven days once the scores are released. No argument after one week. 7. There will be a 20% penalty for late submission within a week and no point after a week. 8. Only when your results from Python are correct, the bonus of using Scala will be calculated. There is no partial point for Scala. 7. Common problems causing fail submission on Vocareum/FAQ (If your program runs seems successfully on your local machine but fail on Vocareum, please check these)1. Try your program on Vocareum terminal. Remember to set python version as python3.6, Use the latest Spark /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit Select JDK 8 by running the command “export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64” 2. Check the input command line formats. 3. Check the output formats, for example, the headers, tags, typos. 4. Check the requirements of sorting the results. 5. Your program scripts should be named as task1.py task2.py etc. 6. Check whether your local environment fits the assignment description, i.e. version, configuration. 7. If you implement the core part in python instead of spark, or implement it with a high time complexity (e.g. search an element in a list instead of a set), your program may be killed on the Vocareum because it runs too slow. 8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any points if you use Spark DataFrame or DataSet.Don’t import sparksql. 9. Do not use Vocareum for debugging purposes, please debug on your local machine. Vocareum can be very slow if you use it for debugging. 10. Vocareum is reliable in helping you to check the input and output formats, but its function on checking the code correctness is limited. It can not guarantee the correctness of the code even with a full score in the submission report. 11. Some students encounter an error like: the output rate …. has exceeded the allowed value ….bytes/s; attempting to kill the process. To resolve this, please remove all print statements and set the Spark logging level such that it limits the logs generated – that can be done using sc.setLogLevel . Preferably, set the log level to either WARN or ERROR when submitting your code.
In this assignment, you will implement the SON Algorithm using the Spark Framework. You will develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world generated dataset. The goal of this assignment is to apply the algorithms you have learned in class on large datasets more efficiently in a distributed environment.2.1 Programming Requirements a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any points if you use Spark DataFrame or DataSet. c. Python standard library set : https://docs.python.org/3/library/2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum. On Vocareum, you can call `spark-submit` located at `/opt/spark/spark-3.1.2-binhadoop3.2/bin/spark-submit`. (Do not use the one at /usr/local/bin/spark-submit). We use `–executor-memory 4G –driver-memory 4G` on Vocareum for grading.2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism. We will report all detected plagiarism and severe penalties will be given for the students whose submissions are plagiarized.2.4 What you need to turn in We will grade all submissions on Vocareum and the submissions on the blackboard will be ignored. Vocareum produces a submission report after you click the “Submit” button (It takes a while since Vocareum needs to run your code in order to generate the report). Vocareum will only grade Python scripts during the submission phase and it will grade both Python and Scala during the grading phase. a. Two Python scripts, named: (all lowercase) task1.py, task2.py b. [OPTIONAL] hw2.jar and two Scala scripts, named: (all lowercase) hw2.jar, task1.scala, task2.scala c. You don’t need to include your results or the datasets. We will grade your code with our testing data (data will be in the same format). d. Students can submit an unlimited number of times. Only the latest submission will be accepted and graded.In this assignment, you will use one simulated dataset and one real-world dataset. In task 1, you will build and test your program with a small simulated CSV file that has been provided to you. Then in task2 you need to generate a subset using the Ta Feng dataset with a structure similar to the simulated data. Figure 1 shows the file structure of task1 simulated csv, the first column is user_id and the second column is business_id. Figure 1: Input Data FormatIn this assignment, you will implement the SON Algorithm to solve all tasks (Task 1 and 2) on top of Spark Framework. You need to find all the possible combinations of the frequent itemsets in any given input file within the required time. You can refer to Chapter 6 from the Mining of Massive Datasets book and concentrate on section 6.4 – Limited-Pass Algorithms. (Hint: you can choose either A-Priori, MultiHash, or PCY algorithm to process each chunk of the data)4.1 Task 1: Simulated data (3 pts) There are two CSV files (small1.csv and small2.csv) in Vocareum under ‘../resource/asnlib/publicdata’. The small1.csv is just a test file that you can use to debug your code. For task1, we will only test your code on small2.csv. In this task, you need to build two kinds of market-basket models. Case 1 (1.5 pts): You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each user containing the business ids reviewed by this user. If a business was reviewed more than once by a reviewer, we consider this product was rated only once. More specifically, the business ids within each basket are unique.The generated baskets are similar to: user1: [business11, business12, business13, …] user2: [business21, business22, business23, …] user3: [business31, business32, business33, …] Case 2 (1.5 pts): You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each business containing the user ids that commented on this business. Similar to case 1, the user ids within each basket are unique. The generated baskets are similar to: business1: [user11, user12, user13, …] business2: [user21, user22, user23, …] business3: [user31, user32, user33, …] Input format: 1. Case number: Integer that specifies the case. 1 for Case 1 and 2 for Case 2. 2. Support: Integer that defines the minimum count to qualify as a frequent itemset. 3. Input file path: This is the path to the input file including path, file name and extension.4. Output file path: This is the path to the output file including path, file name and extension. Output format: 1. Runtime: the total execution time from loading the file till finishing writing the output file You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”. 2. Output file: (1) Intermediate result You should use “Candidates:” as the tag. For each line you should output the candidates of frequent itemsets you found after the first pass of SON Algorithm followed by an empty line after each combination.The printed itemsets must be sorted in lexicographical order (Both user_id and business_id are types of string). (2) Final result You should use “Frequent Itemsets:”as the tag. For each line you should output the final frequent itemsets you found after finishing the SON Algorithm. The format is the same with the intermediate results. The printed itemsets must be sorted in lexicographical order. Here is an example of the output file: Both the intermediate results and final results should be saved in ONE output result file. Command line Format: Python: spark-submit task1.py Scala: spark-submit –class task1 hw2.jar Command line Example: /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit –executor-memory 4G — driver-memory 4G task1.py 1 4 ../resource/asnlib/publicdata/small1.csv task1_output.txt4.2 Task 2: Ta Feng data (4 pts) In task 2, you will explore the Ta Feng dataset to find the frequent itemsets (only case 1). You will use data found here from Kaggle (https://bit.ly/2miWqFS) to find product IDs associated with a given customer ID each day. Aggregate all purchases a customer makes within a day into one basket. In other words, assume a customer purchases at once all items purchased within a day. The data file is provided at ../resource/asnlib/publicdata/ta_feng_all_months_merged.csv Note: Be careful when reading the csv file as spark can read the product id numbers with leading zeros.You can manually format Column F (PRODUCT_ID) to numbers (with zero decimal places) in the csv file before reading it using spark. SON Algorithm on Ta Feng data: You will create a data pipeline where the input is the raw Ta Feng data, and the output is the file described under “output file”. You will pre-process the data, and then from this pre-processed data, you will create the final output. Your code is allowed to output this pre-processed data during execution, but you should NOT submit homework that includes this pre-processed data. (1) Data preprocessing You need to generate a dataset from the TaFeng dataset with following steps: 1. Find the date of the purchase (column TRANSACTION_DT), such as December 1, 2000 (12/1/00) 2. At each date, select “CUSTOMER_ID” and “PRODUCT_ID”. 3. We want to consider all items bought by a consumer each day as a separate transaction (i.e., “baskets”). For example, if consumer 1, 2, and 3 each bought oranges December 2, 2000, and consumer 2 also bought celery on December 3, 2000, we would consider that to be 4 separate transactions. An easy way to do this is to rename each CUSTOMER_ID as “DATE-CUSTOMER_ID”. For example, if CUSTOMER_ID is 12321, and this customer bought apples November 14, 2000, then their new ID is “11/14/00-12321” 4. Make sure each line in the CSV file is “DATE-CUSTOMER_ID1, PRODUCT_ID1”.5. The header of CSV file should be “DATE-CUSTOMER_ID, PRODUCT_ID” You need to save the dataset in CSV format. Figure below shows an example of the output file (please note DATE-CUSTOMER_ID and PRODUCT_ID are strings and integers, respectively) Figure: customer_product file Do NOT submit the output file of this data preprocessing step, but your code is allowed to create this file. (2) Apply SON Algorithm The requirements for task 2 are similar to task 1. However, you will test your implementation with the large dataset you just generated. For this purpose, you need to report the total execution time. For this execution time, we take into account the time from reading the file till writing the results to the output file. You are asked to find the candidate and frequent itemsets (similar to the previous task) using the file you just generated.The following are the steps you need to do: 1. Reading the customer_product CSV file in to RDD and then build the case 1 market-basket model 2. Find out qualified customers-date who purchased more than k items. (k is the filter threshold); 3. Apply the SON Algorithm code to the filtered market-basket model; Input format: 1. Filter threshold: Integer that is used to filter out qualified users 2. Support: Integer that defines the minimum count to qualify as a frequent itemset. 3. Input file path: This is the path to the input file including path, file name and extension.4. Output file path: This is the path to the output file including path, file name and extension. Output format: 1. Runtime: the total execution time from loading the file till finishing writing the output file You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”. 2. Output file The output file format is the same with task 1. Both the intermediate results and final results should be saved in ONE output result file. Command line Format: Python: spark-submit task2.py Scala: spark-submit –class task2 hw2.jar Command line Example: /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit –executor-memory 4G –driver-memory 4G task2.py 20 50 ../resource/asnlib/publicdata/ta_feng_all_months_merged.csv task2_output.txt 6. Evaluation Metric Task 1: Input File Case Support Runtime (sec) small1.csv 1 4
In assignment 1, you will work on three tasks. The goal of these tasks is to get you familiar with Spark operation types (e.g., transformations and actions) and explore a real-world dataset: Yelp dataset (https://www.yelp.com/dataset). If you have questions about the assignment, please ask on Piazza, this helps promote interactions amongst students and will also serve as an FAQ for other students facing similar problems. You have to submit your assignments on Vocareum directly.2.1 Programming Requirements a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed) because that is sufficient for this programming assignment. There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any points if you use Spark DataFrame or DataSet. c. Python standard library set : https://docs.python.org/3/library/2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no points granted if we cannot run your code on Vocareum. On Vocareum, you can call `spark-submit` located at `/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (Do not use the one at /usr/local/bin/spark-submit (2.3.0)). We use `–executor-memory 4G –driver-memory 4G` on Vocareum for grading.2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code that can be found from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism and severe penalties will be given for the students whose submissions are plagiarized.2.4 What you need to turn in We will grade all submissions on Vocareum. Vocareum produces a submission report after you click the “Submit” button (It takes a while since Vocareum needs to run your code in order to generate the report).Vocareum will only grade Python scripts during the submission phase and it will grade both Python and Scala during the grading phase. a. [REQUIRED] three Python scripts, named: (all lowercase) task1.py, task2.py, task3.py b1. [OPTIONAL, REQUIRED FOR SCALA] three Scala scripts and the output jar file, named: (all lowercase) hw1.jar, task1.scala, task2.scala, task3.scala c. You don’t need to include your results or the datasets. We will grade your code with our testing data (data will be in the same format). d. Students can submit an unlimited number of times. Only the latest submission will be accepted and graded.In this assignment, you will explore the Yelp dataset. You can find the data on Vocareum under resource/asnlib/publicdata/. The two files business.json and test_review.json are the two files you will work on for this assignment, and they are subsets of the original Yelp Dataset. The submission report you get from Vocareum is for the subsets. For grading, we will use the files from the original Yelp dataset which is SIGNIFICANTLY larger (e.g. review.json can be 5GB). You should make sure your code works well on large datasets as well.4.1 Task1: Data Exploration (3 points) You will work on test_review.json, which contains the review information from users, and write a program to automatically answer the following questions: A. The total number of reviews (0.5 point) B. The number of reviews in 2018 (0.5 point) C. The number of distinct users who wrote reviews (0.5 point) D. The top 10 users who wrote the largest numbers of reviews and the number of reviews they wrote (0.5 point) E. The number of distinct businesses that have been reviewed (0.5 point) F. The top 10 businesses that had the largest numbers of reviews and the number of reviews they had (0.5 point) Input format: (we will use the following command to execute your code) Python: /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit –executor-memory 4G –driver-memory 4G task1.py Scala: spark-submit –class task1 –executor-memory 4G –driver-memory 4G hw1.jarOutput format: IMPORTANT: Please strictly follow the output format since your code will be graded automatically. a. The output for Questions A/B/C/E will be a number. The output for Questions D/F will be a list, which is sorted by the number of reviews in the descending order. If two user_ids/business_ids have the same number of reviews, please sort the user_ids /business_ids in the lexicographical order. b. You need to write the results in the JSON format file.You must use exactly the same tags (see the red boxes in Figure 2) for answering each question. Figure 1: JSON output structure for task14.2 Task2: Partition (2 points) Since processing large volumes of data requires performance optimizations, properly partitioning the data for processing is imperative. In this task, you will show the number of partitions for the RDD used for Task 1 Question F and the number of items per partition. Then you need to use a customized partition function to improve the performance of map and reduce tasks. A time duration (for executing Task 1 Question F) comparison between the system default partition and your customized partition (RDD built using the partition function) should also be shown in your results.Hint: Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for redistributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.So, trying to design a partition function to avoid the shuffle will improve the performance a lot. Input format: (we will use the following command to execute your code) Python: /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit –executor-memory 4G –driver-memory 4G task2.py Scala: spark-submit –class –executor-memory 4G –driver-memory 4G task2 hw1.jar Output format: A. The output for the number of partitions and execution time will be a number. The output for the number of items per partition will be a list of numbers. B. You need to write the results in a JSON file. You must use exactly the same tags. C. Do not round off the execution times. Figure 3: JSON output structure for task24.3 Task3: Exploration on Multiple Datasets (2 points) In task3, you are asked to explore two datasets together containing review information (test_review.json) and business information (business.json) and write a program to answer the following questions: A. What are the average stars for each city? (1 point)1. (DO NOT use the stars information in the business file). 2. (DO NOT discard records with empty “city” field prior to aggregation – this just means that you should not worry about performing any error handling, input data cleanup or handling edge case scenarios). 3. (DO NOT perform any round off for the average stars). B. You are required to compare the execution time of using two methods to print top 10 cities with highest average stars. Please note that this task – (Task 3(B)) is not graded.You will get full points only if you implement the logic to generate the output file required for this task. 1. To evaluate the execution time, start tracking the execution time from the point you load the file. For M1: execution time = loading time + time to create and collect averages, sort using Python and print the first 10 cities. For M2: execution time = loading time + time to create and collect averages, sort using Spark and print the first 10 cities.The loading time will stay the same for both methods, the idea is to compare the overall execution time for both methods and understand which method is more efficient for an end-to-end solution. Please note that for Method 1, only sorting is to be done in Python. Creating and collecting averages needs to be done via RDD. You should store the execution time in the json file with the tags “m1” and “m2”.2. Additionally, add a “reason” field and provide a hard-coded explanation for the observed execution times. 3. Do not round off the execution times. Input format: (we will use the following command to execute your code) Python: /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit –executor-memory 4G –driver-memory 4G task3.py Scala: spark-submit –class task3 –executor-memory 4G –driver-memory 4G hw1.jar Output format: a. You need to write the results for Question A as a text file. The header (first line) of the file is “city,stars”. The outputs should be sorted by the average stars in descending order. If two cities have the same stars, please sort the cities in the lexicographical order. (see Figure 3 left). b. You also need to write the answer for Question B in a JSON file. You must use exactly the same tags for the task. Figure 3: Question A output file structure (left) and JSON output structure (right) for task35. Grading Criteria (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together https://forms.gle/gs5eDtjd1q18nGEx5 1. This form will record the number of late days you use for each assignment. We will not count late days if no request is submitted. 1. There will be a 10% bonus if you use both Scala and Python and get expected results. 2. We will combine all the codes we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. If plagiarism is detected, there will be no point for the entire assignment and we will report all detected plagiarism.3. All submissions will be graded on the Vocareum. Please strictly follow the format provided, otherwise you can’t get the point even though the answer is correct. You are encouraged to try out your code on Vocareum terminal. 4. We will grade both the correctness and efficiency of your implementation. The efficiency is evaluated by processing time and memory usage. The maximum memory allowed to use is 4G, and maximum processing time is 1800s for grading. The datasets used for grading are larger than the ones that you use for doing the assignment. You will get *% penalty if your implementation cannot generate correctness outputs for large files using 4G memory within the 1800s. Therefore, please make sure your implementation is efficient to process large files.5. Regrading policy: We can regrade your assignments within seven days once the scores are released. Regrading requests will not be accepted after one week. 6. There will be a 20% penalty for late submission within a week and no point after a week. If you use your late days, there wouldn’t be a 20% penalty. 7. Only when your results from Python are correct, the bonus of using Scala will be calculated. There is no partial point for Scala. See the example below: Example situations Task Score for Python Score for Scala (10% of previous column if correct) Total Task1 Correct: 3 points Correct: 3 * 10% 3.3 Task1 Wrong: 0 point Correct: 0 * 10% 0.0 Task1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65 Task1 Partially correct: 1.5 points Wrong: 0 1.5 6. Common problems causing fail submission on Vocareum/FAQ (If your program runs successfully on your local machine but fail on Vocareum, please check these)1. Try your program on Vocareum terminal. Remember to set python version as python3.6, And use the latest Spark 2. Check the input command line formats. 3. Check the output formats, for example, the headers, tags, typos. 4. Check the requirements of sorting the results. 5. Your program scripts should be named as task1.py task2.py etc. 6. Check whether your local environment fits the assignment description, i.e. version, configuration. 7. If you implement the core part in python instead of spark, or implement it with a high time complexity (e.g. search an element in a list instead of a set), your program may be killed on the Vocareum because it runs too slow. 8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any points if you use Spark DataFrame or DataSet. Don’t import sparksql.9. Do not use Vocareum for debugging purposes, please debug on your local machine. Vocareum can be very slow if you use it for debugging. 10. Vocareum is reliable in helping you to check the input and output formats, but its function on checking the code correctness is limited. It can not guarantee the correctness of the code even with a full score in the submission report. 11. Some students encounter an error like: the output rate …. has exceeded the allowed value ….bytes/s; attempting to kill the process. To resolve this, please remove all print statements and set the Spark logging level such that it limits the logs generated – that can be done using sc.setLogLevel. Preferably, set the log level to either WARN or ERROR when submitting your code.7. Running Spark on Vocareum We’re going to use Spark 3.1.2 and Scala 2.12 for the assignments and the competition project. Here are the things that you need to do on Vocareum and local machine to run the latest Spark and Scala: On Vocareum: 1. Please select JDK 8 by running the command “export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64”2. Please use the spark-submit command as “/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit” On your local machine: 1. Please download and set up spark-3.1.2-bin-hadoop3.2, the setup steps should be the same as spark-2.4.4 2. If you use Scala, please update Scala’s version to 2.12 on IntelliJ. 8. Tutorials for Spark Installation Here are some useful links here to help you get started with the Spark installation. Tutorial for ubuntu: https://phoenixnap.com/kb/install-spark-on-ubuntu Tutorial for windows: https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c Windows Installation without Anaconda (Recommended): https://phoenixnap.com/kb/install-spark-on-windows-10 Tutorial for mac: https://medium.com/beeranddiapers/installing-apache-spark-on-mac-os-ce416007d79f Tutorial for Linux systems: https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm Tutorial for using IntelliJ: https://medium.com/@Sushil_Kumar/setting-up-spark-with-scala-development-environment-using-int el lij-idea-b22644f73ef1 Tutorial for Jupyter notebook on Windows: https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows / Spark 3.1.2 installation https://archive.apache.org/dist/spark/spark-3.1.2/
In this competition project, you need to improve the performance of your recommendation system from Assignment 3. You can use any method (like the hybrid recommendation systems) to improve the prediction accuracy and efficiency.2. Competition Requirements2.1 Programming Language and Library Requirements a. You must use Python to implement the competition project. You can use any external Python libraries as long as they are available on Vocareum. b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any points if you use Spark DataFrame or DataSet. However, if an external python library requires a separate data structure you may use it to load the data into the library, but make sure to do all data pre/post-processing using a Spark RDD.2.2 Programming Environment Python 3.6, Scala 2.12, JDK 1.8 and Spark 3.1.2 We will use these library versions to compile and test your code. There will be a 20% penalty if we cannot run your code due to the library version inconsistency.2.3 Write your own code Do not share your code with other students!! We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism.In this competition, the datasets you are going to use are from: https://drive.google.com/drive/folders/1SIlY40owpVcGXJw3xeXk76afCwtSUx11?usp=sharing We generated the following two datasets from the original Yelp review dataset with some filters. We randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20% of the data as the testing dataset.A. yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars. B. yelp_val.csv: the validation data, which are in the same format as training data. C. We are not sharing the test dataset. D. other datasets: providing additional information (like the average star or location of a business) a. review_train.json: review data only for the training pairs (user, business) b. user.json: all user metadata c. business.json: all business metadata, including locations, attributes, and categories d. checkin.json: user checkins for individual businesses e. tip.json: tips (short reviews) written by a user about a business f. photo.json: photo data, including captions and classificationsIn the competition, you need to build a recommendation system to predict the given (user, business) pairs. You can mine interesting and useful information from the datasets provided in the Google Drive folder to support your recommendation system. You must make an improvement to your recommendation system from homework assignment 3 in terms of accuracy.You can utilize the validation dataset (yelp_val.csv) to evaluate the accuracy of your recommendation system. There are two options to evaluate your recommendation system: (1) Error Distribution: You can compare your results to the corresponding ground truth and compute the absolute differences. You can divide the absolute differences into 5 levels and count the number for each level as following: >=0 and =1 and =2 and =3 and =4: 12 This means that there are 12345 predictions with < 1 difference from the ground truth.This way you will be able to know the error distribution of your predictions and to improve the performance of your recommendation systems. (2) RMSE Error: You can compute the RMSE (Root Mean Squared Error) by using following formula: where Predi is the prediction for business i and Ratei is the true rating for business i. n is the total number of the business you are predicting. Input format: (we will use the following commands to execute your code) /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit competition.py Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path Param: output_file_name: the name of the prediction result file, including the file pathOutput format: a. The output file is a CSV file, containing all the prediction results for each user and business pair in the validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the order in this task.There is no requirement for the number of decimals for the similarity values. Please refer to the format in Figure 1. Figure 1: Output example in CSV b. You also need to write comments that include the description of your method (less than 300 words) in the first part of your program. The description should include the explanation of the models you are using, especially the way you improved the accuracy or efficiency of the system. We look forward to seeing creative methods. Please also report the error distribution, RMSE, and the total execution time on the validation dataset in the description.Figure 2 shows an example of the description file. If the comments are not included or the comments are not informative, there will be a one-point penalty. Figure 2: An example of description file Grading: We will compare your prediction results against the ground truth. We will use our testing data to evaluate your recommendation systems and grade based on the accuracy using RMSE. To get the full points for the competition project, your RMSE result should beat that of the TAs’ which is 0.9800 for testing data. If your recommendation system only beats .9800 for the validation data, you will receive 50% of the points for the competition.The final submission with the highest accuracy will receive an extra 6 points on the final grade. The second place will receive an extra 5 points. The third one will receive extra 4 points and so on until the sixth one will receive extra 1 point. To be more like a competition, you can see a “Leaderboard” button in the “Competition” on Vocareum. Every time you submit the code, your RMSE for validation data will be scored and show up on the leaderboard. You will have the option to choose your display name on the leaderboard. Partial credit will be given if your RMSE for testing data cannot achieve the threshold. If your homework 3 accuracy is x, and competition is y, and you do not meet the threshold, you would get (1-(y-0.98)/(x-0.98))*total score of the competition.5. Submission You need to submit your Python scripts on Vocareum with exactly the same name: ● competition.py 6. Grading Criteria (% penalty = % penalty of possible points you get) 1. You cannot use the extension for the competition. No late submissions will be accepted for the competition. 2. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. If plagiarism is detected, you will receive no points for the entire assignment and we will report all detected plagiarism.3. All submissions will be graded on Vocareum. Please strictly follow the format provided, otherwise you won’t receive points even though the answer is correct. 4. Do NOT use Spark DataFrame, DataSet, sparksql. 5. We will not conduct regrades on competition submissions. 6. There will be no points awarded if the total execution time exceeds 25 minutes. 7. Common problems causing fail submission on Vocareum/FAQ (If your program runs seem successfully on your local machine but fail on Vocareum, please check these)1. Try your program on Vocareum terminal. Remember to set python version as python3.6, And use the latest Spark /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit 2. Check the input command line format. 3. Check the output format, for example, the header, tag, typos. 4. Your Python script should be named as competition.py 5. Check whether your local environment fits the assignment description, i.e. version, configuration. 6. If you implement the core part in Python instead of Spark, or implement it in a high time complexity way (e.g. search an element in a list instead of a set), your program may be killed on Vocareum because it runs too slowly.
In this problem we will be inferencing SSD ONNX model using ONNX Runtime Server. You will follow the github repo and ONNX tutorials (links provided below). You will start with a pretrained Pytorch SSD model and retrain it for your target categories. Then you will convert this Pytorch model to ONNX and deploy it on ONNX runtime server for inferencing.1. Download pretrained pytorch MobilenetV1 SSD and test it locally using Pascal VOC 2007 dataset. Show the test accuracy for the 20 classes. (4) 2. Select any two related categories from Google Open Images dataset and finetune the pretrained SSD model. Examples include, Aircraft and Aeroplane, Handgun and Shotgun. You can use open_images_downloader.py script provided at the github to download the data. For finetuning you can use the same parameters as in the tutorial below. Compute the accuracy of the test data for these categories before and after finetuning. (5+5)3. Convert the Pytorch model to ONNX format and save it. (4) 4. Visualize the model using net drawer tool. Compile the model using embed_docstring flag and show the visualization output. Also show doc string (stack trace for PyTorch) for different types of nodes. (6)5. Deploy the ONNX model on ONNX runtime (ORT) server. You need to set up the environment following steps listed in the tutorial. Then you need make HTTP request to the ORT server. Test the inferencing set-up using 1 image from each of the two selected categories. (6)6. Parse the response message from the ORT server and annotate the two images. Show inferencing output (bounding boxes with labels) for the two images. (5)For part 1, 2, and 3, refer to the steps in the github repo. For part 4 refer to ONNX tutorial on visualizing and for 5 and 6 refer to ONNX tutorial on inferencing.References • Github repo. Shot MultiBox Detector Implementation in Pytorch. Available at https://github.com/qfgaohao/pytorch-ssd • ONNX tutorial. Visualizing an ONNX Model. Available at https://github.com/onnx/tutorials/blob/master/tutorials/VisualizingAModel.md • ONNX tutorial. Inferencing SSD ONNX model using ONNX Runtime Server.Available at https://github.com/onnx/tutorials/blob/master/tutorials/OnnxRuntimeServerSSDModel.ipynb • Google. Open Images Dataset V5 + Extensions. Available at https://storage.googleapis.com/openimages/web/index.html • The PASCAL Visual Object Classes Challenge 2007. Available at http://host.robots.ox.ac.uk/pascal/VOC/voc2007/In this question you will analyze different ML cloud platforms and compare their service offerings. In particular, you will consider ML cloud offerings from IBM, Google, Microsoft, and Amazon and compare them on the basis of following criteria:1. Frameworks: DL framework(s) supported and their version. (4) Here we are referring to machine learning platforms which have their own inbuilt images for different frameworks. 2. Compute units: type(s) of compute units offered, i.e., GPU types. (2) 3. Model lifecycle management: tools supported to manage ML model lifecycle. (2)4. Monitoring: availability of application logs and resource (GPU, CPU, memory) usage monitoring data to the user. (2) 5. Visualization during training: performance metrics like accuracy and throughput (2)6. Elastic Scaling: support for elastic scaling compute resources of an ongoing job. (2) 7. Training job description: training job description file format. Show how the same training job is specified in different ML platforms. Identify similar fields in the training job file for the 4 ML platforms through an example. (6)In this problem we will follow Kubeflow-Kale codelab (link below). You will follow the steps as outlined in the codelab to install Kubeflow with MiniKF, convert a Jupyter Notebook to Kubeflow Pipelines, and run Kubeflow Pipelines from inside a Notebook.For each step below you need to show the commands executed, terminal output, and screenshot of visual output (if any). You also need to give a new name to your GCP project and any resource instance you create, e.g., put your initial in the name string.1. Setting up the environment and installing MiniKF: Follow the steps in the codelab to: (a) Set up a GCP project. (2) (b) Install MiniKF and deploy your MinKF instance. (3) (c) Login to MiniKF, Kubeflow, and Rok. (3)2. Run a Pipeline from inside your Notebook: Follow the steps in the codelab to: (a) Create a notebook server. (3) (b) Download and run the notebook: We will be using pytorch-classification notbeook from the example repo. Note that the codelab uses a different example from the repo (titanic dataset ml.ipynb). (4)(c) Convert your notebook to a Kubeflow Pipeline: Enable Kale and then compile and run the pipeline from Kale Deployment Panel. Show output from each of the 5 steps of the pipeline (5) (d) Show snapshots of ”Graph” and ”Run output” of the experiment. (4) (e) Cleanup: Destroy the MiniKF VM. (1)References • Codelab. From Notebook to Kubeflow Pipelines with MiniKF and Kale. Available at https://codelabs.developers.google.com/codelabs/cloud-kubeflow-minikf-kale • https://github.com/kubeflow-kale/examplesThis question is based on Deep RL concepts discussed in Lecture 8. You need to refer to the papers by Mnih et al., Nair et al., and Horgan et al. to answer this question. All papers are linked below. 1. Explain the difference between episodic and continuous tasks? Given an example of each. (2) 2. What do the terms exploration and exploitation mean in RL ? Why do the actors employ -greedy policy for selecting actions at each step? Should remain fixed or follow a schedule during Deep RL training ? How does the value of help balance exploration and exploitation during training. (1+1+1+1)3. How is the Deep Q-Learning algorithm different from Q-learning ? You will follow the steps of Deep Q-Learning algorithm in Mnih et al. (2013) page 5, and explain each step in your own words. (3) 4. What is the benefit of having a target Q-network ? (3) 5. How does experience replay help in efficient Q-learning ? (3) 6. What is prioritized experience replay ? (2) 7. Compare and contrast GORILA (General Reinforcement Learning Architecture) and Ape-X architecture. Provide three similarities and three differences. (3)References • Mnih et al. Playing Atari with Deep Reinforcement Learning. 2013 Available at https://arxiv.org/pdf/1312.5602.pdf • Nair et al. Massively Parallel Methods for Deep Reinforcement Learning. 2015 Available at https://arxiv.org/pdf/1507.04296.pdf • Horgan et al. Distributed Prioritized Experience Replay. 2018 Available at https://arxiv.org/pdf/1803.00933.pdf
In this problem we will train a convolutional neural network for image classification using transfer learning.Transfer learning involves training a base network from scratch on a very large dataset (e.g., Imagenet1K with 1.2 M images and 1K categories) and then using this base network either as a feature extractor or as an initialization network for target task.Thus two major transfer learning scenarios are as follows: • Finetuning the base model: Instead of random initialization, we initialize the network with a pretrained network, like the one that is trained on Imagenet dataset. Rest of the training looks as usual however the learning rate schedule for transfer learning may be different.• Base model as fixed feature extractor: Here, we will freeze the weights for all of the network except that of the final fully connected layer. This last fully connected layer is replaced with a new one with random weights and only this layer is trained.1. For fine-tuning you will select a target dataset from the Visual-Decathlon challenge. Their web site (link below) has several datasets which you can download. Select any one of the visual decathlon dataset and make it your target dataset for transfer learning. Important : Do not select Imagenet1K as the target dataset.(a) Finetuning: You will first load a pretrained model (Resnet50) and change the final fully connected layer output to the number of classes in the target dataset. Describe your target dataset features, number of classes and distribution of images per class (i.e., number of images per class). Show any 4 sample images (belonging to 2 different classes) from your target dataset. (2+2)(b) First finetune by setting the same value of hyperparameters (learning rate=0.001, momentum=0.9) for all the layers. Keep batch size of 64 and train for 200-300 epochs or until model converges well.You will use a multi-step learning rate schedule and decay by a factor of 0.1 (γ = 0.1 in the link below). You can choose steps at which you want to decay the learning rate but do 3 drops during the training. So the first drop will bring down the learning rate to 0.0001, second to 0.00001, third to 0.000001. For example, if training for 200 epochs, first drop can happen at epoch 60, second at epoch 120 and third at epoch 180. (8)(c) Next keeping all the hyperparameters same as before, change the learning rate to 0.01 and 0.1 uniformly for all the layers. This means keep all the layers at same learning rate. So you will be doing two experiments, one keeping learning rate of all layers at 0.01 and one with 0.1. Again finetune the model and report the final accuracy. How does the accuracy with the three learning rates compare ? Which learning rate gives you the best accuracy on the target dataset ? (6)2. When using a pretrained model as feature extractor, all the layers of the network are frozen except the final layer. Thus except the last layer, none of the inner layers’ gradients are updated during backward pass with the target dataset. Since gradients do not need to be computed for most of the network, this is faster than finetuning.(a) Now train only the last layer for 1, 0.1, 0.01, and 0.001 while keeping all the other hyperparameters and settings same as earlier for finetuning. Which learning rate gives you the best accuracy on the target dataset ? (8)(b) For your target dataset find the best final accuracy (across all the learning rates) from the two transfer learning approaches. Which approach and learning rate is the winner? Provide a plausible explanation to support your observation. (4)For this problem the following resources will be helpful. References • Pytorch blog. Transfer Learning for Computer Vision Tutorial by S. Chilamkurthy Available at https://pytorch.org/tutorials/beginner/transfer learning tutorial.html • Notes on Transfer Learning. CS231n Convolutional Neural Networks for Visual Recognition. Available at https://cs231n.github.io/transfer-learning/ • Visual Domain DecathlonThis problem is based on two papers, by Mahajan et al. on weakly supervised pretraining and by Yalinz et al. on semi-supervised learning for image classification. Both of these papers are from Facebook and used 1B images wiith hashtags. Read the two papers thoroughly and then answer the following questions.You can discuss these papers with your classmates if this helps in clarifying your doubts and improving your understanding. However no sharing of answers is permitted and all the questions should be answered individually in your own words.1. Both the papers use the same 1B image dataset. However one does weakly supervised pretraining while the other does semi-supervised . What is the difference between weakly supervised and semi-supervised pretraining ? How do they use the same dataset to do two different types of pretraining ? Explain. (2)2. These questions are based on the paper by Mahajan et al. (a) Are the model trained using hashtags robust against noise in the labels ? What experiments were done in the paper to study this and what was the finding ? Provide numbers from the paper to support your answer. (2) (b) Why is resampling of hashtag distribution important during pretraining for transfer learning ? (2)3. These questions are based on the paper by Yalzin et al. (a) Why are there two models, a teacher and a student, and how does the student model leverages the teacher model ? Explain why teacher-student modeling is a type of distillation technique. (2+2) (b) What are the parameters K and P in stage 2 of the approach where unlabeled images are assigned classes using teacher network ? What was the idea behind taking P > 1 ? Explain in your own words. (2+2)(c) Explain how a new labeled dataset is created using unlabeled images ? Can an image in this new dataset belong to more than one class ? Explain. (2+2) (d) Refer to Figure 5 in the paper. Why does the accuracy of the student model first improves as we increase the value of K and then decreases ? (2)References • Yalniz et al. Billion-scale semi-supervised learning for image classification. Available at https://arxiv.org/pdf/1905.00546.pdf • Mahajan et al. Exploring the Limits of Weakly Supervised Pretraining. Available at https://arxiv.org/pdf/1805.00932.pdfThis question is based on modeling the execution time of deep learning networks by calculating the floating point operations required at each layer. We looked at two papers in the class, one by Lu et al. and the other by Qi et al.1. Why achieving peak FLOPs from hardware devices like GPUs is a difficult propostion in real systems ? How does PPP help in capturing this inefficiency captured in Paleo model. (4)2. Lu et al. showed that FLOPs consumed by convolution layers in VG16 account for about 99% of the total FLOPS in the forward pass. We will do a similar analysis for VGG19. Calculate FLOPs for different layers in VGG19 and then calculate fraction of the total FLOPs attributed by convolution layers. (6)3. Study the tables showing timing benchmarks from Alexnet (Table 2), VGG16 (Table 3), Googlenet (Table 5), and Resnet50 (Table 6). Why the measured time and sum of layerwise timings for forward pass did not match on GPUs ? What approach was adopted in Sec. 5 of the paper to mitigate the measurement overhead in GPUs. (2+2)4. In Lu et al. FLOPs for different layers of a DNN are calculated. Use FLOPs numbers for VGG16 (Table 3), Googlenet (Table 5), and Resnet50 (Table 6), and calculate the inference time (time to have a forward pass with one image) using published Tflops number for K80 (Refer to NVIDIA TESLA GPU Accelerators). Use this to calculate the peak (theoretical) throughput achieved with K80 for these 3 models. (6)References • Qi et al. PALEO: A Performance model for Deep Neural Networks. ICLR 2017. Available at https://openreview.net/pdf?id=SyVVJ85lg • Lu et al. Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices. 2017 Available at https://arxiv.org/pdf/1709.09503.pdfPeng et al. proposed Optimus scheduler for deep learning clusters which makes use of a predictive model to estimate the remaining time of a training job. Optimus assumes a parameter-server architecture for distributed training where synchronization between parameter server(s) and workers happen after every training step.The time taken to complete one training step on a worker includes the time for doing forward propagation (i.e., loss computation) and backward propagation (i.e., gradients computation) at the worker, the worker pushing gradients to parameter servers, parameter servers updating parameters, and the worker pulling updated parameters from parameter servers, plus extra communication overhead.The predictive model proposed in Optimus is based on two sub-models, one to model the training loss as a function of number of steps and the other to model the training speed (training steps per unit time) as a function of resources (number of workers and parameter servers). The training loss model is given by Equation (1) in the paper. It has three parameters β0, β1, and β2 that needs to be estimated from the data.1. The first step is to generate data for predictive model calibration. You will train Resnet models with different number of layers (18, 20, 32, 44, 56) each with 3 different GPU types (K80, P100, V100). For these runs you will use CIFAR10, a batch size of 128, and run each job for 350 epochs.You need to collect training logs containing data on training loss and step number for different configuration. The data collection can be done in a group of up to 5 students. If working as a group each student should pick one of the 5 Resnet models and train it on all three GPU types. So each student in the group will be contributing training data from 3 experiments. If you decide to collaborate in the data collection please clearly mention the name of students involved in your submission. For each of these 15 experiments, use all the training data and calibrate a training loss model. You will report 15 models one of each of the experimental configuration and their corresponding parameters (β0, β1, β2). (15)2. We next study how the learned parameters, β0, β1, and β2, change with the type of GPUs and the size of network. Use a regression model on the data from 15 models to predict the value of these parameters as a function of number of layers in Resnet and GPU type. From these regresssion model predict the training loss curve for Resnet-50. Note that we are effectively doing prediction for a predictive model.To verify how good is this prediction, you will train Resnet-50 on a K80, P100, and V100 for target accuracy of 92% and compare the predicted loss curve with the real measurements. Show this comparison in a graph and calculate the percentage error. From the predicted loss curve get the number of epochs needed to achive 92% accuracy. Observe that there are three curves for three different GPU types, but the number of epochs required to reach a particular accuracy (convergence rate) should be independent of hardware. (8)3. Using the predicted number of epochs for Resnet-50 along with the resource-speed model (use Equation (4) in Peng et al. along with its coefficients from the paper) obtain the time to accuracy of Resnet-50 (to reach 92% accuracy) in two different setting (with 2 and 4 parameter servers respectively) as a function of the number of workers. So you will be plotting two curves, one for 2 and one for 4 parameter server case. Each curve will show how the time to achieve 92% accuracy (on the y-axis) scales with number of workers (on the x-axis). (7)References • Peng et al. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters Available at https://i.cs.hku.hk/ cwu/papers/yhpeng-eurosys18.pdfNote • In 5.2 other than the ResNet layers being different, every other hyperparameter should be the same during the data collection process across different students in a group (i.e.: Learning Rate, optimizer, preprocessing/normalization method. etc.). You should also use SGD optimizer since it’s one of the key assumptions by Peng et al.• When determining the βs, you can use scipy’s curve fit function for regression based on k (effective step number) and l (training loss).
We will consider five methods, AdaGrad, RMSProp, RMSProp+Nesterov, AdaDelta, Adam, and study their convergence using CIFAR-10 dataset. We will use multi-layer neural network model with two fully connected hidden layers with 1000 hidden units each and ReLU activation with minibatch size of 128.1. Write the weight update equations for the five adaptive learning rate methods. Explain each term clearly. What are the hyperparameters in each policy ? Explain how AdaDelta and Adam are different from RMSProp. (5+1)2. Train the neural network using all the five methods with L2-regularization for 200 epochs each and plot the training loss vs number of epochs. Which method performs best (lowest training loss) ? (5)3. Add dropout (probability 0.2 for input layer and 0.5 for hidden layers) and train the neural network again using all the five methods for 200 epochs. Compare the training loss with that in part 2. Which method performs the best ? For the five methods, compare their training time (to finish 200 epochs with dropout) to the training time in part 2 (to finish 200 epochs without dropout). (5)4. Compare test accuracy of trained model for all the five methods from part 2 and part 3. Note that to calculate test accuracy of model trained using dropout you need to appropriately scale the weights (by the dropout probability). (4)References: • The CIFAR-10 Dataset.In this problem we will compare strong scaling and weak scaling in distributed training using tensorflow.distribute.strategy in Tensorflow 2.0. tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. In strong scaling, each worker computes with (batch size/# workers) training examples whereas in weak scaling, the effective batch size of SGD grows as the number of workers increases.For example, in strong scaling, if the batch size with 1 worker is 256, with 2 workers it will be 128 per worker, with 4 workers it will be 64 per worker, thus keeping the effective batch size at 256. In weak scaling, if the batch size with 1 worker is 64, with 2 workers it will be still be 64 per worker (with an effective batch size of 128 with 2 workers), thus the effective batch size increases linearly with number of workers.So the amount of compute per worker decreases in strong scaling whereas with weak scaling it remains constant.Using FashionMNIST dataset and Resnet50 you will run distributed training using tensorflow.distribute.strategy and compare strong and weak scaling scenarios. Using an effective batch size of 256 you will run training jobs with 1,2,4,8,16 learners (each learner is a K80 GPU). For 8 or less number of learners, all the GPUs can be allocated on the same node (GCP povides 8 K80s on one node).You will run each training job for 10 epochs and measure average throughout, training time, and training cost. In total, you will be running 10 training jobs, 5 (with 1,2,4,8,16 GPUs) for weak scaling and 5 for strong scaling.For single node (worker) training using multiple GPUs you will use tf.distribute.MirroredStrategy with default all-reduce. For training with two or more workers you will use tf.distribute.experimental.MultiWorkerMirroredStrategy with CollectiveCommunication.AUTO. 1. Plot throughput vs number of learners for weak and strong scaling. (5)2. Plot training time vs number of learners for weak and strong scaling. (5) 3. Plot training cost vs number of learners for weak and strong scaling. The training cost can be estimated using GPU per unit hour cost and the training time. (2)4. For weak scaling, calculate scaling efficiency defined as the increase in time to finish one iteration at a learner as the number of learners increases. Show the plot of scaling efficiency vs number of learners for weak scaling. (5)5. MirroredStrategy uses NVIDIA NCCL (tf.distribute.NcclAllReduce) as the default all-reduce. Change this to tf.distribute.HierarchicalCopyAllReduce and tf.distribute.ReductionToOneDevice and compare throughput of the three all-reduce implementations. You will be doing this for 1,2,4,8 GPUs single-node training. So you will be running 8 new training jobs (4 with HierarchicalCopyAllReduce and 4 with ReductionToOneDevice). For NcclAllReduce you can reuse results from part 1 of the question. (8)6. Change MultiWorkerMirroredStrategy to use CollectiveCommunication.NCCL and CollectiveCommunication.RING and repeat the experiment with 2 nodes. Yow will be running two new training jobs (one with RING and one with NCCL). For AUTO you can reuse throughput from part 1 of the question. Compare the throughput of the three all-reduce methods (AUTO, NCCL, RING) ? Does AUTO gives the best throughput ? (5)References: • Tensorflow Blog. Distributed Training with Tensorflow. Problem 3 – Convolutional Neural Networks Architectures 30 points In this problem we will study and compare different convolutional neural network architectures. We will calculate number of parameters (weights, to be learned) and memory requirement of each network. We will also analyze inception modules and understand their design.1. Calculate the number of parameters in Alexnet. You will have to show calculations for each layer and then sum it to obtain the total number of parameters in Alexnet. When calculating you will need to account for all the filters (size, strides, padding) at each layer. Look at Sec. 3.5 and Figure 2 in Alexnet paper (see reference). Points will only be given when explicit calculations are shown for each layer. (5)2. VGG (Simonyan et al.) has an extremely homogeneous architecture that only performs 3×3 convolutions with stride 1 and pad 1 and 2×2 max pooling with stride 2 (and no padding) from the beginning to the end. However VGGNet is very expensive to evaluate and uses a lot more memory and parameters. Refer to VGG19 architecture on page 3 in Table 1 of the paper by Simonyan et al. You need to complete Table 1 below for calculating activation units and parameters at each layer in VGG19 (without counting biases). Its been partially filled for you. (6)3. VGG architectures have smaller filters but deeper networks compared to Alexnet (3×3 compared to 11×11 or 5×5). Show that a stack of N convolution layers each of filter size F × F has the same receptive field as one convolution layer with filter of size (NF − N + 1) × (NF − N + 1). Use this to calculate the receptive field of 3 filters of size 5×5. (4)4. The original Googlenet paper (Szegedy et al.) proposes two architectures for Inception module, shown in Figure 2 on page 5 of the paper, referred to as naive and dimensionality reduction respectively. (a) What is the general idea behind designing an inception module (parallel convolutional filters of different sizes with a pooling followed by concatenation) in a convolutional neural network ? (3)Layer Number of Activations (Memory) Parameters (Compute) Input 224*224*3=150K 0 CONV3-64 224*224*64=3.2M (3*3*3)*64 = 1,728 CONV3-64 224*224*64=3.2M (3*3*64)*64 = 36,864 POOL2 112*112*64=800K 0 CONV3-128 CONV3-128 POOL2 56*56*128=400K 0 CONV3-256 CONV3-256 56*56*256=800K (3*3*256)*256 = 589,824 CONV3-256 CONV3-256 POOL2 0 CONV3-512 28*28*512=400K (3*3*256)*512 = 1,179,648 CONV3-512 CONV3-512 28*28*512=400K CONV3-512 POOL2 0 CONV3-512 CONV3-512 CONV3-512 CONV3-512 POOL2 0 FC 4096 FC 4096 4096*4096 = 16,777,216 FC 1000 TOTAL Table 1: VGG19 memory and weights(b) Assuming the input to inception module (referred to as ”previous layer” in Figure 2 of the paper) has size 32x32x256, calculate the output size after filter concatenation for the naive and dimensionality reduction inception architectures with number of filters given in Figure 1. (4) (c) Next calculate the total number of convolutional operations for each of the two inception architecture again assuming the input to the module has dimensions 32x32x256 and number of filters given in Figure 1. (4)(d) Based on the calculations in part (c) explain the problem with naive architecture and how dimensionality reduction architecture helps (Hint: compare computational complexity). How much is the computational saving ? (2+2) Figure 1: Two types of inception module with number of filters and input size for calculation in Question 3.4(b) and 3.4(c).References: • (Alexnet) Alex Krizhevsky et al. ImageNet Classification with Deep Convolutional Neural Networks. Paper available at https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutionalneural-networks.pdf • (VGG) Karen Simonyan et al. Very Deep Convolutional Networks for Large-scale Image Recognition. Paper available at https://arxiv.org/pdf/1409.1556.pdf• (Googlenet) Christian Szegedy et al. Going deeper with convolutions. Paper available at https://arxiv.org/pdf/1409.4842.pdf Problem 4 – Batch Augmentation, Cutout Regularization 20 points In this problem we will be achieving large-batch SGD using batch augmentation techniques. In batch augmentation instances of samples within the same batch are generated with different data augmentations.Batch augmentation acts as a regularizer and an accelerator, increasing both generalization and performance scaling. One such augmentation scheme is using Cutout regularization, where additional samples are generated by occluding random portions of an image.1. Explain cutout regularization and its advantages compared to simple dropout (as argued in the paper by DeVries et al) in your own words. Select any 2 images from CIFAR10 and show how does these images look after applying cutout. Use a square-shaped fixed size zero-mask to a random location of each image and generate its cutout version. Refer to the paper by DeVries et al (Section 3) and associated github repository. (2+4)2. Using CIFAR10 datasest and Resnet-44 we will first apply simple data augmentation as in He et al. (look at Section 4.2 of He et al.) and train the model with batch size 64. Note that testing is always done with original images. Plot validation error vs number of training epochs. (4)3. Next use cutout for data augmentation in Resnet-44 as in Hoffer et al. and train the model and use the same set-up in your experiments. Plot validation error vs number of epochs for different values of M (2,4,8,16,32) where M is the number of instances generated from an input sample after applying cutout M times effectively increasing the batch size to M ·B, where B is the original batch size (before applying cutout augmentation).You will obtain a figure similar to Figure 3(a) in the paper by Hoffer et al. Also compare the number of epochs and wallclock time to reach 94% accuracy for different values of M. Do not run any experiment for more than 100 epochs. If even after 100 epochs of training you did not achieve 94% then just report the accuracy you obtain and the corresponding wallclock time to train for 100 epochs. Before attempting this question it is advisable to read paper by Hoffer et al. and especially Section 4.1. (5+5)You may reuse code from github repository associated with Hoffer et al. work for answering part 2 and 3 of this question.References: • DeVries et al. Improved Regularization of Convolutional Neural Networks with Cutout. Paper available at https://arxiv.org/pdf/1708.04552.pdf Code available at https://github.com/uoguelph-mlrg/Cutout• Hoffer et al. Augment your batch: better training with larger batches. 2019 Paper available at https://arxiv.org/pdf/1901.09335.pdf Code available at https://github.com/eladhoffer/convNet.pytorch/tree/master/models • He et al. Deep residual learning for image recognition. Paper available at https://arxiv.org/abs/1512.03385Multilayer layer feedforward network, with as little as two layers and sufficiently large hidden units can approximate any arbitrary function. Thus one can tradeoff between deep and shallow networks for the same problem. In this problem we will study this tradeoff using the Eggholder function defined as: f(x1, x2) = − (x2 + 47) sin rx1 2 + (x2 + 47)− x1 sin p |x1 − (x2 + 47)|Let y(x1, x2) = f(x1, x2) + N (0, 0.3) be the function that we want to learn from a neural network through regression with −512 ≤ x1 ≤ 512 and −512 ≤ x2 ≤ 512. Draw a dataset of 100K points from this function (uniformly sampling in the range of x1 and x2) and do a 80/20 training/test split.1. Assume that total budget for number of hidden units we can have in the network is 512. Train a 1, 2, and 3 hidden layers feedforward neural network to learn the regression function. For each neural network you can consider a different number of hidden units per hidden layer so that the total number of hidden units does not exceed 512. We would recommend to work with 16, 32, 64, 128, 256, 512, hidden units per layer. So if there is only one hidden layer you can have at most 512 units in that layer.If there are two hidden layers, you can have any combination of hidden units in each layer, e.g., 16 and 256, 64 and 128, etc. such that the total is less than 512. Plot the RMSE (Root Mean Square Error) on test set for networks with different number of hidden layers as a function of total number of hidden units. If there are more than one network with the same number of hidden units (say a two hidden layer with 16 in first layer and 128 in second layer and another network with 128 in first layer and 16 in second) you will use the average RMSE.So you will have a figure with three curves, one each for 1, 2, and 3 layer networks, with x-axis being the total number of hidden units. Also plot another curve but with the x-axis being the number of parameters (weights) that you need to learn in the network. (20)2. Comment on the tradeoff between number of parameters and RMSE as you go from deeper (3 hidden layers) to shallow networks (1 hidden layer). Also measure the wall clock time for training each configuration and plot training time vs number of parameters. Do you see a similar tradeoff in training time ? (10)For networks with 2 and 3 layers you will use batch normalization as regularization. For hidden layers use ReLU activation and for training use SGD with Nesterov momentum. Take a batch size of 1000 and train for 2000 epochs. You can pick other hyperparameter values (momentum, learning rate schedule) or use the default values in the framework implementation.
Consider a dataset with two features x1 and x2 in which the points (−1, −1),(1, 1),(−3, −3),(4, 4) belong to one class and (−1, 1),(1, −1),(−5, 2),(4, −8) belong to the other.1. Is this dataset linearly separable ? Can a linear classifier be trained using features x1 and x2 to classify this data set ? You can plot the dataset points and argue. (2)2. Can you define a new 1-dimensional representation z in terms of x1 and x2 such that the dataset is linearly separable in terms of 1-dimensional representation corresponding to z ? (4) 3. What does the separating hyperplane looks like ? (2) 4. Explain the importance of nonlinear transformations in classification problems. (2)1. Derive the bias-variance decomposition for a regression problem, i.e., prove that the expected mean squared error of a regression problem can be written as E[MSE] = Bias2 + V ariance + NoiseHint: Let y(x) = f(x) + be the true (unknown) relationship and ˆy = g(x) be the model predicted value of y. Then MSE over test instance xi , i = 1, . . . , t, is given by: MSE = 1 t Xt i=1 (f(xi) + − g(xi))2 (5)2. Consider the case when y(x) = x + sin(1.5x) + N (0, 0.3), here f(x) = x + sin(1.5x) and = N (0, 0.3). Create a dataset of size 20 points by randomly generating samples from y. Display the dataset and f(x). Use scatter plot for y and smooth line plot for f(x). (5)3. Use weighted sum of polynomials as an estimator function for f(x), in particular, let the form of estimator function be: gn(x) = β0 + β1x + β2x 2 + ….. + βnx nConsider three candidate estimators, g1, g3, and g10. Estimate the coefficients of each of the three estimators using the sampled dataset and plot f(x), g1(x), g3(x), g10(x). Which estimator is underfitting ? Which one is overfitting ? (10)4. Generate 100 datasets (each of size 50) by randomly sampling from y. Partition each dataset into training and test set (80/20 split). Next fit the estimators of varying complexity, i.e., g1, g2, ….g15 using the training set for each dataset. Then calculate and display the squared bias, variance, and error on testing set for each of the estimators showing the tradeoff between bias and variance with model complexity. Can you identify the best model ? (10)5. One way to increase model bias is by using regularization. Lets take the order 10 polynomial and apply L2 regularization. Compare the bias, variance, and MSE of the regularized model with the unregularized order 10 polynomial model ? Does the regularized model have a higher or lower bias ? What about MSE ? Explain. (10)OpenML (https://www.openml.org) has thousands of datasets for classification tasks. Select any 2 datasets from OpenML with different number of output classes. 1. Summarize the attributes of each dataset: number of features, number of instances, number of classes, number of numerical features, number of categorical features. (5)2. For each dataset, select 80% of data as training set and remaining 20% as test set. Generate 10 different subsets of the training set by randomly subsampling 10%, 20%, . . . , 100% of the training set. Use each of these subsets to train two different classifiers: Random forest and Gradient boosting. When training a classifier also measure the wall clock time to train.After each training, evaluate the accuracy of trained models on the test set. Report model accuracy and training time for each of the 10 subsets of the training set. Generate learning curve for each classifier. A learning curve shows how the accuracy changes with increasing size of training data. Also create a curve showing the training time of classifiers with increasing size of training data. So, for each dataset you will have two figures: First figure showing learning curves (x-axis being training data size and y-axis accuracy) for the two classifiers and second Figure showing training time for the two classifiers as a function of training data size. (15)3. Study the scaling of training time and accuracy of classifiers with training data size using the two figures generated in part 2 of the question. Compare the performance of classifiers in terms of training time and accuracy and write 3 main observations. Which gives better accuracy ? Which has shorter training time ? (5)This question is based on two papers, one from ICML 2006 and other from NIPS 2015 (details below). ICML paper talks about the relationship between ROC and Precision-Recall (PR) curves and shows a one-to-one correspondence between them. NIPS paper introduces Precision-Recall-Gain (PRG) curves. You need to refer to the two papers to answer the following questions.1. Does true negative matter for both ROC and PR curve ? Argue why each point on ROC curve corresponds to a unique point on PR curve ? (5) 2. Select one OpenML dataset with 2 output classes. Use two binary classifiers (Adaboost and Logistic regression) and create ROC and PR curves for each of them. You will have two figures: one containing two ROC and other containing two PR curves. Show the point where an all positive classifier lies in the ROC and PR curves. (10)3. NIPS paper defined PR Gain curve. Calculate AUROC (Area under ROC), AUPR (Area under PR), and AUPRG (Area under PRG) for two classifiers and compare. Do you agree with the conclusion of NIPS paper that practitioners should use PR gain curves rather than PR curves. (10)Related papers: • Jesse Davis, Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves, ICML 2006. https://www.biostat.wisc.edu/ page/rocpr.pdf • Peter A. Flach and Meelis Kull, Precision-Recall-Gain Curves: PR Analysis Done Right, NIPS 2015. https://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right
Consider a 2-dimensional data set in which all points with x1 > x2 belong to the positive class, and all points with x1 ≤ x2 belong to the negative class. Therefore, the true separator of the two classes is linear hyperplane (line) defined by x1 − x2 = 0.Now create a training data set with 20 points randomly generated inside the unit square in the positive quadrant. Label each point depending on whether or not the first coordinate x1 is greater than its second coordinate x2.Now consider the following loss function for training pair (X, y ¯ ) and weight vector W¯ : L = max{0, a − y(W¯ · X¯)}, where the test instances are predicted as ˆy = sign{W¯ · X¯}. For this problem, W¯ = [w1, w2], X¯ = [x1, x2] and ˆy = sign(w1x1 + w2x2). A value of a = 0 corresponds to the perceptron criterion and a value of a = 1 corresponds to hinge-loss.1. Implement the perceptron algorithm without regularization, train it on the 20 points above, and test its accuracy on 1000 randomly generated points inside the unit square. Generate the test points using the same procedure as the training points. (6)2. Change the perceptron criterion to hinge-loss in your implementation for training, and repeat the accuracy computation on the same test points above. Regularization is not used. (5) 3. In which case do you obtain better accuracy and why? (2)4. In which case do you think that the classification of the same 1000 test instances will not change significantly by using a different set of 20 training points? (2)Read the two blogs, one by Andre Pernunicic and other by Daniel Godoy on weight initialization. You will reuse the code at github repo linked in the blog for explaining vanishing and exploding gradients. You can use the same 5 layer neural network model as in the repo and the same dataset.1. Explain vanishing gradients phenomenon using standard normalization with different values of standard deviation and tanh and sigmoid activation functions. Then show how Xavier (aka Glorot normal) initialization of weights helps in dealing with this problem. Next use ReLU activation and show that instead of Xavier initialization, He initialization works better for ReLU activation. You can plot activations at each of the 5 layers to answer this question. (10)2. The dying ReLU is a kind of vanishing gradient, which refers to a problem when ReLU neurons become inactive and only output 0 for any input. In the worst case of dying ReLU, ReLU neurons at a certain layer are all dead, i.e., the entire network dies and is referred as the dying ReLU neural networks in Lu et al (reference below). A dying ReLU neural network collapses to a constant function.Show this phenomenon using any one of the three 1-dimensional functions in page 11 of Lu et al. Use a 10-layer ReLU network with width 2 (hidden units per layer). Use minibatch of 64 and draw training data uniformly from [− √ 7, √7]. Perform 1000 independent training simulations each with 3,000 training points. Out of these 1000 simulations, what fraction resulted in neural network collapse. Is your answer close to over 90% as was reported in Lu et al. ? (10)3. Instead of ReLU consider Leaky ReLU activation as defined below: φ(z) = z if z > 0 0.01z if z ≤ 0.Run the 1000 training simulations in part 2 with Leaky ReLU activation and keeping everything else same. Again calculate the fraction of simulations that resulted in neural network collapse. Did Leaky ReLU help in preventing dying neurons ? (10)References: • Andre Perunicic. Understand neural network weight initialization. Available at https://intoli.com/blog/neural-network-initialization/ • Daniel Godoy. Hyper-parameters in Action Part II — Weight Initializers. • Initializers – Keras documentation. https://keras.io/initializers/. • Lu Lu et al. Dying ReLU and Initialization: Theory and Numerical Examples .Batch normalization and Dropout are used as effective regularization techniques. However its not clear which one should be preferred and whether their benefits add up when used in conjunction. In this problem we will compare batch normalization, dropout, and their conjunction using MNIST and LeNet-5 (see e.g., https://engmrk.com/lenet-5-a-classic-cnn-architecture/). LeNet-5 is one of the earliest convolutional neural network developed for image classification and its implementation in all major framework is available.You can refer to Lecture 3 slides for definition of standardization and batch normalization. 1. Explain the terms co-adaptation and internal covariance-shift. Use examples if needed. You may need to refer to two papers mentioned below to answer this question. (5)2. Batch normalization is traditionally used in hidden layers, for input layer standard normalization is used. In standard normalization the mean and standard deviation are calculated using the entire training dataset whereas in batch normalization these statistics are calculated for each mini-batch. Train LeNet-5 with standard normalization of input and batch normalization for hidden layers. What are the learned batch norm parameters for each layer ? (5)3. Next instead of standard normalization use batch normalization for input layer also and train the network. Plot the distribution of learned batch norm parameters for each layer (including input) using violin plots. Compare the train/test accuracy and loss for the two cases ? Did batch normalization for input layer improve performance ? (5)4. Train the network without batch normalization but this time use dropout. For hidden layers use dropout probability of 0.5 and for input layer take it to be 0.2 Compare test accuracy using dropout to test accuracy obtained using batch normalization in part 2 and 3. (5)5. Now train the network using both batch normalization and dropout. How does the performance (test accuracy) of the network compare with the cases with dropout alone and with batch normalization alone ? (5)References: • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R.Salakhutdinov . Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Available at at https://www.cs.toronto.edu/ rsalakhu/papers/srivastava14a.pdf. • S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Available at https://arxiv.org/abs/1502.03167.Recall cyclical learning rate policy discussed in Lecture 4. The learning rate changes in cyclical manner between lrmin and lrmax, which are hyperparameters that need to be specified. For this problem you first need to read carefully the article referenced below as you will be making use of the code there (in Keras) and modifying it as needed. For those who want to work in Pytorch there are open source implementations of this policy available which you can easily search for and build over them. You will work with FashionMNIST dataset and MiniGoogLeNet (described in reference).1. Summarize FashionMNIST dataset, total dataset size, training set size, validation set size, number of classes, number of images per class. Show any 3 representative images from any 3 classes in the dataset. (3)2. Fix batch size to 64 and start with 10 candidate learning rates between 10−9 and 101 and train your model for 5 epochs. Plot the training loss as a function of learning rate. You should see a curve like Figure 3 in reference below. From that figure identify the values of lrmin and lrmax. (5)3. Use the cyclical learning rate policy (with exponential decay) and train your network using batch size 64 and lrmin and lrmax values obtained in part 1. Plot train/validation loss and accuracy curve (similar to Figure 4 in reference). (5)4. Fix learning rate to lrmin and train your network starting with batch size 64 and going upto 8192. If your GPU cannot handle large batch sizes, you can employ effective batch size approach as discussed in Lecture 3 to simulate large batches. Plot the training loss as a function of batch size. Do you see a similar behavior of training loss with respect to batch size as seen in part 2 with respect to learning rate ? (5)5. Can you identify bmin and bmax from the figure in part 4 for devising a cyclical batch size policy ? Create an algorithm for automatically determining batch size and show its steps in a block diagram as in Figure 1 of reference. (4)6. Use bmin and bmax values identified in part 3 and devise a cyclical batch size policy such that the batch size changes in a cyclical manner between bmin and bmax. In part 3 we did exponential decrease in learning rate as training progress. What should be an analogous trajectory for batch size as training progresses, exponential increase or decrease ? Use cyclical batch size policy (with appropriate trajectory) and train your network using learning rate lrmin. (6)7. Compare the best accuracy from the two cyclical policies. Which policy gives you the best accuracy ? (2) PS: In part 3 of problem we are doing cyclical learning rate with exponential decay. The code under ”Keras Learning Rate Finder” in the blog implements triangular policy, you may need to change it to have exponential decay as mentioned in the first reference below. For part 4 and 6, you will be writing your own python project ”Keras Batch Finder”.References: 1. Leslie N. Smith Cyclical Learning Rates for Training Neural Networks. Available at https://arxiv.org/abs/1506.01186. 2. Keras implementation of cyclical learning rate policy. Available at https://www.pyimagesearch.com/2019/08/05/keraslearning-rate-finder/.
Write a program that will read from a file and display to the console all the words in the file in alphabetical order along with the number of times each word appeared in the file. For example, if the input file is: This is an example, of an input file for project four, as an example. The output to the console would be: an – 3 example – 2 file – 1 for – 1 four – 1 input – 1 is – 1 of – 1 project – 1 This – 1To get the individual words from each line of the file use a regular expression (perhaps with the split method of class String) as shown in lecture. Note that the input file may contain reasonable punctuation marks separating the words.Use a TreeMap to store the words and their counts (that is, TreeMap ) You will need to use the wrapper class Integer to hold the count of the words as TreeMaps do not store primitives.Allow the user to select the input file using a JFileChooser. Since this project can be done with one class, you do not need to create a jar file. You can submit the file Project4.java to Blackboard. Make sure you upload the correct file by the due date (which is also the cutoff date) as there will be no opportunities for resubmission of projects.
Let the player know if he or she won the game by guessing all the words on the solutions list. Show a MessageDialog in this case, and ask the user if he or she would like to play again.Create a File Menu in your GUI Add a file menu to your game GUI with options to open any file for reading (and processing the file as in Project 2), and one to Quit the program. You will need a FileMenuHandler class to handle the events from the FileMenu. Be sure to use getAbsolutePath() when getting the file from the JFileChooser, not getName().Handle Exceptions Create an exception called IllegalWordException (by extending IlegalArgumentException as shown in lecture) and have the constructor of the Word throw it. A Word is illegal if it doesn’t contain all lowercase letters. Use a try/catch statement to catch this exception in your program, and show the erroneous Words in the console. A data file will be provided that has illegal words in it.Create a jar file called Project3.jar and submit that to Blackboard by the due date for full credit. Be sure your jar file contains .java files, not .class files.
Add the following improvements to the word game. (1) The first letter of the subject letters (the first line of the input file) must be contained in all the correct guessed words. (2) If a guessed word contains ALL of the subject letters, that is worth 3 points. (3) Display the correctly guessed words in alphabetical order.Create a class called WordNode which has fields for the data (a Word) and next (WordNode) instance variables. Include a one-argument constructor which takes a Word as a parameter. (For hints, see the PowerPoint on “Static vs. Dynamic Structures”.) public WordNode (Word w) { . . }The instance variables should have protected access. Create an abstract linked list class called WordList. This should be a linked list with head node as described in lecture. Modify it so that the data type in the nodes is Word. The no-argument constructor should create an empty list with first and last pointing to an empty head node, and length equal to zero. Include an append method in this class.Create two more linked list classes that extend the abstract class WordList: One called UnsortedWordList and one called SortedWordList, each with appropriate no-argument constructors. Each of these classes should have a method called add(Word) that will add a new node to the list. In the case of the UnsortedWordList it will add it to the end of the list by calling the append method in the super class. In the case of the SortedWordList it will insert the node in the proper position to keep the list sorted.Instantiate two linked lists, one sorted and one unsorted. Add the solutions from the input file to the unsorted linked list. This list will be searched to see if a guessed word matches. As words are correctly guessed, add them to the sorted list and display the contents of that list in the TextArea for the guessed words. Check the method setText in class TextArea to update the contents of the TextArea.Submit a jar file. Rather than upload all the files for this project separately, we will use Java’s facility to create the equivalent of a zip file that is known as a Java ARchive file, or “jar” file. Instructions on how to create a jar file using Eclipse are on Blackboard. Be sure to include source files (.java files), not class files in your jar. Create a jar file called Project2.jar and submit that.
This project is loosely based on a word puzzle called the Spelling Beehive found in the Sunday New York Times magazine. In it, a player is given a set of seven letters and has to find as many words as possible using some portion, but at least five, of those seven letters. Letters may be used more than once. Each correct word earns one point.The input file To make a simple example, let’s suppose the player is given just four letters (instead of the seven we will use for this project) and has to make words of at least three letters. The first line of the input file will be the letters to use, and the rest of the input file will contain solutions that would be hidden from the user.Here is an example: PRTA PART TARP ART RAT APART TRAP Etc.You program should read the first line into a String variable for the letters, and the rest of the file into an array of Strings against which the user’s guesses can be matched.Create a GUI for the puzzle with a grid layout of one row and two columns. In the left column put the puzzle letters, and in the right column display the words that the user has found so far (words the user has guessed and your program has found on the solutions list.) and the user’s score. Accept words from the user via a JOptionPane.MessageDialogs should be shown to the user in the following cases: 1. The user has used a letter that is not one of the seven letters given. 2. The user’s guess is less than 5 letters long. 3. The user’s guess is not in the solutions list.Submitting the Project. You should have two files to submit for this project: Project1.java PuzzleGUI.java Upload your project files to Blackboard by the due date for full credit.