Programming Assignment 4 1 Goal The goal of this project is to perform gradient domain image blending. Basically, given a source image and a mask, our goal is to seamlessy blend the masked areas of the source image into a target photograph. 2 Starter Code Starter code (in Python) along with the images can be downloaded from here. Note that, expected results for a few images are included in the “Results” folder. 3 Main Task Here, you are required to implement gradient domain image blending, described in Perez, et al.’s paper. Your implementation should run properly on all the examples (which means you need to handle the boundary conditions – see Sec. 3.3), however, you will only be able to produce high-quality results on images 1 to 5 (images 6 and 7 require Poisson blending with mixing gradients – see Extra Credit). To get the full point, you need to create one composite of your own. You can use the GetMask function to obtain a mask on your images. The implementation of gradiant domain blending is not complicated, but it is easy to make a mistake. Below we provide a detailed description of the process for a simple 1D case and then extend it to 2D. 3.1 Simple 1D Example Note: the following guide is for MATLAB, but there are similar functions in python as well. You can find equivalent functions by seraching “Similar function to MATLAB X in Python”. Here, we are going to discuss a simple 1D case to get you started with implementing the 2D case. 3.1.1 Hole-Filling Let’s start with a simple case where instead of copying in new gradients we only want to fill in a missing region of an image and keep the gradients as smooth (close to zero) as possible. To simplify things further, let’s start with a one dimensional signal instead of a two dimensional image. Here is our signal t (see Fig. 1) and a mask M specifying which “pixels” are missing. t = [5 4 0 0 0 0 2 4]; M = [0 0 1 1 1 1 0 0]; M = logical(M); % converting the mask from double to binary We can formulate our objective as a least squares problem. Given the intensity values of t, we want to solve for new intensity values v under the mask M such that: v = argmin X (vi − vj)2, with vi = ti for all i ∈ ∼ M. (1) v ∈M Here, i is a coordinate (1d or 2d) for each pixel under mask M. Each j is a neighbor of i. The summation guides the gradient (the local pixel differences) in all directions to be close to 0. Minimizing this equation could be called a Poisson fill.Figure 1: On the left, we show the target signal t. The goal is to fill in the samples 3 to 6 smoothly using the Poisson equation. On the right, we show the filled in result tsmoothed. For this example let’s define neighborhood to be the pixel to your left. The optimal pixel values can be obtained by setting the derivative of the above equation equal to zero. This gives us two equations at each pixel as follows: for all i ∈ M, vi − vi−1 = 0 and vi+1− vi = 0, (2) where vj = tj for all j ∈∼ M. This produces the following system of equations: v(1) – t(2) = 0; %left border v(2) – v(1) = 0; v(3) – v(2) = 0; v(4) – v(3) = 0; t(7) – v(4) = 0; %right border Note that the coordinates do not directly correspond between v and t. For example, v(1), the first unknown pixel, sits on top of t(3). You could formulate it differently if you choose. Plugging in known values of t we get: v(1) – 4 = 0; v(2) – v(1) = 0; v(3) – v(2) = 0; v(4) – v(3) = 0; 2 – v(4) = 0; Now let’s convert this to matrix form and have Matlab solve it for us A = [ 1 0 0 0; … -1 1 0 0; … 0 -1 1 0; … 0 0 -1 1; … 0 0 0 -1]; b = [4; 0; 0; 0; -2]; v = A; % “help mldivide” describes the ’’ operator. t_smoothed = zeros(size(t)); t_smoothed(˜M) = t(˜M); t_smoothed( M) = v; As it turns out, in the 1d case, the Poisson fill is simply a linear interpolation between the boundary values. But in 2d the Poisson fill exhibits more complexity. 3.1.2 Blending Source Now instead of just doing a fill, let’s try to seamlessly blend content from one 1d signal into another. We’ll fill the missing values in t using the corresponding values in s (see Fig. 2): s = [8 6 7 2 4 5 7 8]; Now our objective changes. Instead of trying to minimize the gradients, we want the gradients to match another set of gradients (those in s). Therefore, our set of equations changes as follows: v(1) – t(2) = s(3) – s(2); v(2) – v(1) = s(4) – s(3); v(3) – v(2) = s(5) – s(4); v(4) – v(3) = s(6) – s(5); t(7) – v(4) = s(7) – s(6); After plugging in known values from t and s this becomes: v(1) – 4 = 1; v(2) – v(1) = -5; v(3) – v(2) = 2; v(4) – v(3) = 1; 2 – v(4) = 2; Finally, in matrix form for MATLAB A = [ 1 0 0 0; … -1 1 0 0; … 0 -1 1 0; … 0 0 -1 1; … 0 0 0 -1]; b = [5; -5; 2; 1; 0]; v = A; t_and_s_blended = zeros(size(t)); t_and_s_blended(˜M) = t(˜M); t_and_s_blended( M) = v; Notice that in our quest to preserve gradients without regard for intensity we might have gone too far. Our signal now has negative values. The same thing can happen in the image domain, so you’ll want to watch for that and at the very least clamp values back to the valid range.Figure 2: On the left, we show the source signal s. The goal is to blend this source signal with the target from Fig. 1. On the right, we show the blended result tandsblended. 3.2 Extension to 2D When working with images, the basic idea is the same as above, except that each pixel has at least two neighbors (left and top) and possibly four neighbors. Either formulation will work. For example, in a 2d image using a 4-connected neighborhood, our equations above imply that for a single pixel in v, at coordinate (i,j) which is fully under the mask you would have the following equations: v(i,j) – v(i-1, j) = s(i,j) – s(i-1, j); v(i,j) – v(i+1, j) = s(i,j) – s(i+1, j); v(i,j) – v(i, j-1) = s(i,j) – s(i, j-1); v(i,j) – v(i, j+1) = s(i,j) – s(i, j+1); 4*v(i,j) – v(i-1, j) – v(i+1, j) – v(i, j-1) – v(i, j+1) = 4*s(i,j) – s(i-1, j) – s(i+1, j) – s(i, j-1) – s(i, j+1); where everything on the right hand side is known. This formulation is similar to equation 8 in Perez, et al.’s´ paper. You can read the paper, especially the “Discrete Poisson Solver”, if you want more guidance. 3.3 Hints • For color images, process each color channel independently (hint: matrix A will not change, so don’t go through the computational expense of rebuilding it for each color channel). • The linear system of equations (and thus the matrix A) becomes enormous. But A is also very sparse because each equation only relates a pixel to some number of its immediate neighbors. • You will need to keep track of the relationship between coordinates in matrix A and image coordinates. You might need a dedicated data structure to keep track of the mapping between rows and columns of A and pixels in s and t. • Not every pixel has left, right, top, and bottom neighbors. Handling these boundary conditions might get slightly messy. You can start by assuming that all masked pixels have 4 neighbors (this is true for the majority of cases), but your code should properly handle the boundary conditions (image 7). • Your algorithm can be made significantly faster by finding all the A matrix values and coordinates ahead of time and then constructing the sparse matrix in one operation. This should speed up blending from minutes to seconds. 4 Extra Credit Mixing Gradients – Mixing Gradients to allow transparent images or images with holes. Instead of trying to adhere to the gradients of the source image, at each masked pixel use the largest gradient available in either the source or target. You can test your implementation on images 6 and 7. Note that mixing gradients is different from average gradients. You won’t receive any points for implementing average gradients. 5 Write Up Include all the results in your report. For each result, you should show the input images, as well as the naive blending along with your blended result. Describe how you implemented the assignment. Discuss any problem you faced when implementing the assignment or any decisions you had to make. 6 Graduate Credit Graduate students have to do the extra credit. 7 Deliverables Your entire project should be in a folder called “firstnamelastname”. This folder should be zipped up and submitted through e-campus. Inside the folder, you should have the followings: • A folder named “Code” containing all the codes for this assignment. Please include a README file to explain what each file does if you add any other files to the starter code. • A report in the pdf format. Make sure you write your name on top of the report. Also make sure the pdf file is under 5 MB. Make sure you exclude all the results and original images from your submission. 8 Checklist Make sure you can check all the items below before submitting your assignment. You will lose 5 points for each item that cannot be checked. The folder is named properly (“firstnamelastname”). Note between first and last name. The folder structure should be exactly as follows “firstname lastname.zip” – “firstname lastname” ∗ “Code” ∗ Report.pdf Inside the root folder, there is a folder called “Code” that contains your source code. Also make sure the report is in the root folder. The folders “Images” and “Results” are not included (you only submit your codes and a report). Name written on top of the report. The report is in pdf format and the file is under 5 MB. 9 Ruberic Total credit: [100 points] [55 points] – Implement Poisson blending [15 points] – Implement the approach efficiently using sparse matrix [10 points] – Handle the boundary condition [10 points] – Include one example of your own [10 points] – Write up Extra credit: [15 points] [15 points] – Mixing gradients 10 Acknowlegements The project is derived from James Hays Computational Photography course with permission.
Programming Assignment 3 1 Goal The goal of the first part of the project is to create hybrid images using the approach described in the SIGGRAPH 2006 paper by Oliva, Torralba, and Schyns. Hybrid images are static images that change in interpretation as a function of the viewing distance. The basic idea is that high frequency tends to dominate perception when it is available, but, at a distance, only the low frequency (smooth) part of the signal can be seen. By blending the high frequency portion of one image with the low-frequency portion of another, you get a hybrid image that leads to different interpretations at different distances. The goal of the second part of this project is to perform image blending using pyramids. Basically, given a source image, a target image, and a mask, our goal is to blend the masked areas of the source image into a target photograph in a seamless manner. 2 Starter Code Starter code (in Python) along with the images can be downloaded from here. Note that, expected result for hybrid images is included in the “Results” folder. 3 Task 1 The starter code takes the correspondences between the two input images to align them. The alignment is important because it affects the perceptual grouping (read the paper for details). First, you’ll need to get a few pairs of images that you want to make into hybrid images. You can use the sample images for debugging, but you should use your own images in your results. Then, you will need to write code to low-pass filter one image, high-pass filter the second image. For a low-pass filter, Oliva et al. suggest using a standard 2D Gaussian filter. For a high-pass filter, they suggest using the impulse filter minus the Gaussian filter (which can be computed by subtracting the Gaussian-filtered image from the original). The standard deviation of each filter should be chosen with some experimentation. Note that you need to write your own code to generate the Gaussian kernel by approximating the continous Gaussian function as done in the class. Once you create your kernel, filtering can be done by scipy.signal.convolve2d. Show your results on two examples in addition to the one provided. 3.1 Extra Credit Try using color to enhance the effect. Does it work better to use color for the high-frequency component, the low-frequency component, or both? Discuss it in the report. 4 Task 2 Here, you implement the pyramid blending approach that was discussed in the class. Given a source, target, and a mask, you have to follow the following steps to generate the blended image: • Compute Laplacian pyramids LS and LT from the source and target images. • Compute a Gaussian pyramid GM from the mask. • Use the the Gaussian pyramid to combine the source and target Laplacian pyramids as follows: LC(l) = GM(l)× LS(l)+(1− GM(l))× LT(l), (1) where l is the layer index. • Collapse the blended pyramid LC to reconsruct the final blended image You can test your implementation on image 1. You should also create one example of your own where pyramid blending produces a nice blended image. Additionally show an example where the approach fails (explain why it fails). You can use the GetMask function to obtain a mask on your images. You should use skimage.transform.resize to build the Gaussian and Laplacian pyramids. Note that skimage.transform.resize already performs the filtering and subsampling. So you don’t need to worry about aliasing. Just use this function to implement the Reduce and Expand operations. 5 Write Up Include all the results in your report. For hybrid images, show the inputs as well as the hybrid image. For pyramid blending, you should show the input images, as well as the naive blending along with your blended result. Describe how you implemented the assignment. Discuss any problem you faced when implementing the assignment or any decisions you had to make. 6 Graduate Credit Graduate students have to do the extra credit. 7 Deliverables Your entire project should be in a folder called “firstnamelastname”. This folder should be zipped up and submitted through e-campus. Inside the folder, you should have the followings: • A folder named “Code” containing all the codes for this assignment. Please include a README file to explain what each file does if you add any other files to the starter code. • A report in the pdf format. Make sure you write your name on top of the report. Also make sure the pdf file is under 5 MB. Make sure you exclude all the results and original images from your submission. 8 Checklist Make sure you can check all the items below before submitting your assignment. You will lose 5 points for each item that cannot be checked. The folder is named properly (“firstnamelastname”). Note between first and last name. The folder structure should be exactly as follows “firstnamelastname.zip” – “firstnamelastname” ∗ “Code” ∗ Report.pdf Inside the root folder, there is a folder called “Code” that contains your source code. Also make sure the report is in the root folder. The folders “Images” and “Results” are not included (you only submit your codes and a report). Name written on top of the report. The report is in pdf format and the file is under 10 MB. All the results are included in the report. Original images or results are not included in the package. 9 Ruberic Total credit: [100 points] [30 points] – Implement hybrid images [10 points] – Include two example of your own [40 points] – Implement pyramid blending [10 points] – Include two example of your own (success and failure) [10 points] – Write up Extra credit: [5 points] [5 points] – Implement hybrid images in color 10 Acknowlegements The project is derived from Alyosha Efros Computational Photography course with permission.
Assignment 2 1 Background Sergei Mikhailovich Prokudin-Gorskii (1863-1944) was a man well ahead of his time. Convinced, as early as 1907, that color photography was the wave of the future, he won Tzar’s special permission to travel across the vast Russian Empire and take color photographs of everything he saw including the only color portrait of Leo Tolstoy. And he really photographed everything: people, buildings, landscapes, railroads, bridges… thousands of color pictures! His idea was simple: record three exposures of every scene onto a glass plate using a red, a green, and a blue filter. Never mind that there was no way to print color photographs until much later – he envisioned special projectors to be installed in “multimedia” classrooms all across Russia where the children would be able to learn about their vast country. Alas, his plans never materialized: he left Russia in 1918, right after the revolution, never to return again. Luckily, his RGB glass plate negatives, capturing the last years of the Russian Empire, survived and were purchased in 1948 by the Library of Congress. The LoC has recently digitized the negatives and made them available on-line. 2 Goal The goal of this assignment is to take the digitized Prokudin-Gorskii glass plate images and, using image processing techniques, automatically produce a color image with as few visual artifacts as possible. This should be done by extracting the three color channel images, placing them on top of each other, and aligning them, so that they form a single RGB color image. The starter code, provided below, performs all the operations with the exception of alignment. Your job is to perform alignment by implementing the FindShift function. We will assume that a simple (x,y) translation model is sufficient for proper alignment. 3 Starter Code Starter code (in Python) along with the images can be downloaded from here. Note that, expected results for a few images are included in the “Results” folder. 4 Task1 The digitized glass plate images (an example shown on the right) are in BGR order from top to bottom. Your program takes a glass plate image as input and produce a single color image as output. The program should divide the image into three equal parts and align the second and the third parts (G and R) to the first (B). For each image, you will need to print the (x,y) translation that was used to align the parts. The easiest way to align the parts is to exhaustively search over a window of possible displacements (say [-20, 20] pixels), score each one using some image matching metric, and take the displacement with the best score. There is a number of possible metrics that one could use to score how well the images match. An example of such a metric is sum of squared differences SSD defined as: W H SSD(I1,I2) = XX(I1(i,j)− I2(i,j))2, (1) i=1 j=1 To do this, you should write a for loop to searching over a user-specified window of displacements. For each displacement, compute the score and save it into an array. At the end, you should choose the displacement with the best score (minimum SSD). You should test your code on the “jpg”. In the next task, you will be implementing a multi scale algorithm to be able to handle “tif” images with extremely high resolution. 5 Task 2 Exhaustive search will become prohibitively expensive if the pixel displacement is too large (which will be the case for high-resolution glass plate scans). In this case, you will need to implement a faster search procedure such as an image pyramid. An image pyramid represents the image at multiple scales (usually scaled by a factor of 2) and the processing is done sequentially starting from the coarsest scale (smallest image) and going down the pyramid, updating your estimate as you go. Specifically, you need to create an image pyramid by resizing your image multiple times (you can use skimage.transform.resize for resizing). For example, if you choose pyramid with 4 levels and your image has a resolution of 1024×1024, you will have four images with resolutions of 1024×1024, 512×512, 256×256, and 128×128. You start finding the shift in the smallest image. Since the image is small in this scale, a small search range ([-20, 20]) would be sufficient. Once the translation (x,y) at this scale is found, you properly scale this translation to the finer scale and use it as the initial translation vector. At this finer scale, you will only search around this initial translation vector. You continue this process until reaching the finest scale. 6 Hints • You need to implement almost everything from scratch (except functions for reading, writing, resizing, shifting, and displaying images; some of these like reading and writing are already done in the starter code). In particular, you are not allowed to use high level functions for constructing pyramids, computing matching scores, etc. • The average running time is expected to be less than 1 minute per image (ideally a few seconds). If it takes hours for your program to finish, you should further optimize the code. • Avoid too many for loops by vectorizing/parallelizing your code (See more details here and here). • Do not get bogged down tweaking input parameters. Most, but not all images will line up using the same parameters. Your final results should be the product of a fixed set of parameters (if you have free parameters). Do not worry if one or two of the handout images do not align properly. In particular, the emir image is difficult to align with basic SSD metric. For that better features (see extra credit section) is needed. • Shifting a matrix can be done using the circshift function which is provided in the starter code. • Compute your metric on the internal pixels only, to avoid using the invalid border pixels. 7 Extra Credit • Automatic cropping Remove white, black or other color borders. Don’t just crop a predefined margin off of each side – actually try to detect the borders or the edge between the border and the image. • Automatic white balance This involves two problems – 1) estimating the illuminant and 2) manipulating the colors to counteract the illuminant and simulate a neutral illuminant. Step 1 is difficult in general, while step 2 is simple (see the Wikipedia page on Color Balance and section 2.3.2 in the Szeliski book). There exist some simple algorithms for step 1, which don’t necessarily work well – assume that the average color or the brightest color is the illuminant and shift those to gray or white. • Better features Instead of aligning based on RGB similarity, try using edges. This will particularly help with “emir.tif”. • Better transformations Instead of searching for the best x and y translation, additionally search over small scale changes and rotations. Adding two more dimensions to your search will slow things down, but the same course to fine progression should help alleviate this. 8 Write up Include all the images with their corresponding (x,y) shifts in your report. Make sure you only include the jpg images for task 1 (single scale) and tif images for task 2 (multi scale). Describe the method and all the parameters used to obtain the results. Also discuss any extra credit you did. 9 Graduate Credit Graduate students have to do at least 10 points worth of extra credit. 10 Deliverables Your entire project should be in a folder called “firstnamelastname”. This folder should be zipped up and submitted through Canvas. Inside the folder, you should have the followings: • A folder named “Code” containing all the codes for this assignment. Please include a README file to explain what each file does if you add any other files to the starter code. • A report in the pdf format. Make sure you write your name on top of the report. Make sure you exclude all the results and original images from your submission. 11 Checklist Make sure you can check all the items below before submitting your assignment. You will lose 5 points for each item that cannot be checked. The folder is named properly (“firstnamelastname”). The folder structure should be exactly as follows “firstnamelastname.zip” – “firstnamelastname” ∗ “Code” ∗ Report.pdf Inside the root folder, there is a folder called “Code” that contains your source code. Also make sure the report is in the root folder. The folders “Images” and “Results” are not included (you only submit your codes and a report). Name written on top of the report. The report is in pdf format. The file size should be under 10 MB. All the results are included in the report. Original images or results are not included in the package. 12 Ruberic Total credit: [100 points] [40 points] – Implement the single scale approach [50 points] – Implement the multi-scale approach [10 points] – Write up Extra credit: [up to 10 points] [04 points] – Automatic cropping [03 points] – Automatic contrasting [04 points] – Automatic white balance [03 points] – Better features [05 points] – Better transformations 13 Acknowledgments This project is derived from Alexei A. Efros Computational Photography course with permission.
Required Reading Technical Requirements VM Troubleshooting How To Ask For Help Prerequisites Man in the Middle Database Security Malware Analysis API Security RSA Cryptography Binary Exploitation Log4Shell Machine Learning Web Security Statistics Projects Man in the Middle FAQ Background+Setup Quick Intro to Wireshark Flag 1 Flag 2 Flag 3 Flag 4 Flag 5 Flag 6 Submission Projects Man in the Middle Flag 1 Flag 1 (5 points) Your first task is to figure out where the hackers are spending their time and gather some evidence for the Attorney General. This will also give you a good overview of Wireshark filters. The Attorney General needs some evidence of The Necrocryptors’ associates and where the group meets. For this, you need to gather the following information: Task 1.1 Based on the provided packet capture (pcap) file, identify the server address used by the hackers to communicate. Example: irc.someplace.net Points: 1 Task 1.2 Based on the provided packet capture (pcap) file, identify the nicknames of the malicious actors involved in the conversation. List the nicknames in the order they appear in the conversation following the format below: Example: firstactor,secondactor,thirdactor Points: 1 Task 1.3 Based on the provided packet capture (pcap) file, identify the channel the malicious actors use to communicate. Remember, channel names always start with #, so include # in your answer. Example: #WOW Points: 1 Task 1.4 Based on the provided packet capture (pcap) file, identify the hash used by the malicious actor to validate its identity. Example: a12342342bcde393202013434 Points: 1 Task 1.5 Based on the pcap file provided, analyze the network traffic to determine the potential origin country of the last identified malicious actor. Consider the IP addresses, any geolocation data. Provide the name of the country Example: Atlantis Points: 1 Flag 2 (27 points) Your second task will require you to recover a payload from the conversation. There are multiple ways to do this. You can use Wireshark, pyShark or any other library available. As part of the evidence gathering, the Attorney General needs concrete evidence of malicious intent. For Task 2, you will need to review the conversation between members of TNC and gather incriminating data from this conversation. Task 2.1 Based on the provided pcap file, identify which malicious actor initiated a private chat during the conversation. Example:maliciousactor Points: 2 Task 2.2 Based on the provided pcap file, identify the name of the file transferred by one hacker to another via IRC DCC. (Including extension) Example:somefile.extension Points: 5 Task 2.3 Based on the provided pcap file, determine the encryption method or algorithm used to encrypt the file transferred between the hackers. (Just the 3-letter name) Example:something Points: 4 Task 2.4 If you decrypt and run the file, you’ll get a unique hash based on your GTID. What is the hash generated? Example:a123242342342342342934234 Points: 16 Flag 3 (21 points) The Attorney General lets you know that they think there is a web server in here that is phishy and is spitting out long numbers and letters. The Necrocryptors hacking group is known to play tricks with these values. The Attorney General needs the following information to track the folks operating the website: Task 3.1 The site domain name (Record just the site’s domain name and the top-level-domain (TLD) name, with the period. E.G: something.hostname.tld) Example: something.something.something Points: 2 Task 3.2 What is the public IP address? Example: 192.168.1.10 Points: 2 Task 3.3 The primary nameserver for this domain (You may need to look outside the pcap for this information. Think about tools that will give you the nameserver data for a specific domain) Example: ns-something-something.something.something Points: 6 Task 3.4 The hash provided by entering your Georgia Tech ID in the field (i.e. 9021042) (NOTE: The website is real and safe to access) Example: abcdef1234567890953453434 Points: 11 Flag 4 (27 points) The Attorney General is impressed by you but says they believe the group is also using another server to host a malicious file. It appears that one of the hackers recently accessed this server and downloaded a file from it. As a last minute request, the Attorney General is asking you to investigate what this file is, and where it is hosted. Task 4.1 What is the IP address for the server in question? Example: 192.168.8.7 Points:2 Task 4.2 What is the username used to log in the server? Example: something Points:4 Task 4.3 What is the password used to log in the server? Example: something Points:4 Task 4.4 One file is downloaded from the server, what is the file name? Example: something Points:3 Task 4.5 What is the programming language used to create this file? Example: something Points:5 Task 4.6 If you run this file you’ll get a Combined hash. What is the unique hash for your GTID (i.e 902042)? Example: 12123123129413249121249aa Points:9 Flag 5 (5 points) Exhausted from the prior exercises, the attorney general has two more exercises for you to prove you belong here and that he shouldn’t fire you despite doing a good job. He mentions to you the hackers are getting smart and they have a website called http://www.didbastionbreak.com:5000 that has absolutely nothing to do with Azure Firewalls but everything to do with web application firewalls. Apparently there are some weaknesses integrated into the website which allow you to get to different parts of the website something called a path traversal attack. Task 5.1 There is a flag labeled 5.1 that outputs a hash when you input in your GTID. Try to find the page and recover the flag. Example: tr95843fkdspugr8euyre0gfd Points: 2 Task 5.2 From the main page on the website, click the blue box that says “Download the Zip”. When you do, it downloads a file that is zipped and encrypted with a password. You have to use the tool “John the Ripper” to crack the encryption to find the password. What is the password for your file? Hint: The password is seven numbers long Points: 1 Task 5.3 When you use the password to unlock the file and unzip it, it contains a program. After you run the program, what is the hash provided? Example: 58437594ejgfdiohr8e054309 Points: 2 Suddenly, your phone rings. You see that the call is coming from Bill’ extension.You were ready to head back home and watch Netflix. Here we go again… “Mark, great job so far! I was thinking here. This will not be the last time you will be doing this analysis on pcaps, so why don’t we start building a python class with several methods to automate some of the work for next time?” “When you say we, you are saying, why dont I build this class right?” you say. “Of course not! I already created some skeleton code to help you out. You just need to build 3 functions now” Bill says. “Oh, ok. Thank you Boss..” As you hang up the call, Bill sends you via IM a zip file containing the python class and a attack pcap from a past incident so you can create the functions and test. Flag 6 (15 points) For this task, you need to use the provided pcapanalysis.py and Flag6.pcap files to create three functions. The snippet below shows where you need to code the functions and the expected output on each variable n. You can create as many functions and variables you need, however the provided functions need to return the expected output. Function Skeleton # TODO: # Task 1: Return n being: # n = Number of ICMP Packets def icmp_count(self): n = 0 # TODO: Implement me return n # TODO: # Task 2: Return r,a, being: # r = Number of ICMP Echo Requests # a = ICMP Echo Reply def icmp_request_reply(self): r = 0 a = 0 # TODO: Implement me return r,a # TODO: # Task 3: Return m,n, being: # m = Most Common Destination MAC Address # n = Number of Occurrences def dest_mac(self): m,n = 0,0 # TODO: Implement me return m,n if __name__ == '__main__': pcap_analysis = MITMProject() icmp_count = pcap_analysis.icmp_count() request,reply = pcap_analysis.icmp_request_reply() dest_mac,occurences = pcap_analysis.dest_mac() print("Number of ICMP Packets : ", icmp_count) print("Number of ICMP Requests and Replies : ",request,reply) print("Most Common MAC Address and Number of Ocurrences: ", dest_mac,occurences) To start, make sure that the package pyshark is installed on your system. Please review pyshark Github page to install the package and its dependency (tshark) : https://github.com/KimiNewt/pyshark/ and https://tshark.dev/setup/install/ When you open pcapanalysis.py, make sure student_id is updated with your 9-digit Georgia Tech id # TODO: Change this to YOUR Georgia Tech ID!!! # This is your 9-digit Georgia Tech ID self.student_id = '900000000' Do not modify the import statements. All you need to complete this assignment is there. New imports may be ignored by the autograder and your code will fail. Deliverables: Task 6.1 Modify the def icmp_count(self): function so that it returns an integer, n, which represents the number of ICMP packets in the flag6.pcap file. Points: 3 Task 6.2 Modify the def icmp_request_reply(self): function to return r (the number of ICMP Echo Requests as a integer) and a (the number of ICMP Echo Reply as an integer). Points: 5 Task 6.3 Modify the def dest_mac(self): function to return m (the most common destination MAC address as a string) and n (its number of occurrences as an integer). Points: 7
5/5 - (2 votes) Prelab: pthreads – user level threads Description You are going to demonstrate your understanding of the semantics of an existing threads library (pthreads, to be exact) and how to use it. Under this directory you will find a source (.c) file and a Makefile. (Although we’re not asking you to modify the Makefile now, this would be a good opportunity to take a look and make sure you understand how it works, because you may need to modify Makefiles for later projects.) You can compile the source file with the command make and clean up the compiler output with make clean. The program should compile and run correctly on any Red Hat Enterprise Linux (RHEL) system or Ubuntu 18+. The source code provided is a multi-threaded consumer-producer program. That means producer threads add items (in this case, a character) to a queue, and consumer threads remove the items from the queue and do something with them (in this case, print the character to standard output). The main thread spawns a consumer thread and 2 producer threads, then the producer threads also spawn a consumer thread. So in total there are 2 producer threads and 3 consumer threads (plus the main thread). The program should have the consumers collectively print every letter of the alphabet exactly once for each producer, without crashing. You will find that it compiles correctly, but when you run the program there are several bugs. For this pre-lab, you will be asked to do two things: First, find and fix the five (5) bugs that have been placed in the source code. (Some of these bugs will appear at runtime, but others may only be apparent by code inspection.) Putting comments near the bugs you found will also help us to grade your pre-lab. Second, answer the questions below (short answers, one or two sentences each). It may be convenient to make a copy of the original source file before you modify it, since the questions below include line numbers. To find information about pthread functions, you may use the command man [func name] (e.g. man pthread_create). Please note that all the bugs in the program pertain to the pthreads threading and mutual exclusion libraries in some way. There are not bugs pertaining to program semantics (except for the semantics of pthreads and mutexes), and the intended bugs are all things that are definitely wrong regardless of any common sense interpretation of the program’s intended behavior. Also, all the string constants should be correct. If there is a mistake in a string constant, then that’s my mistake and not one of the intended bugs. (You can still point it out, though, so we can fix it for next time.) These guidelines are meant to help, but if you find something that you’re unsure about, you can ask the TA. Collaboration Students are not allowed to collaborate on this assignment. Each student is expected to find the 5 bugs, fix them, and answer the questions on their own. (You may use reference material, however, to help you understand how pthreads work and the semantics of the function calls.) Questions The main function contains calls to exit() (line 66) and pthread_exit() (line 80). How will the effect of these two calls differ when they are executed? The main function calls pthread_join() (line 90) with the parameter thread_return. Where does the value stored in thread_return come from when the consumer_thread is joined? Where does the value stored in thread_return come from if the joined thread terminated by calling pthread_exit instead of finishing normally? On the same call to pthread_join() (line 90), what will it do if the thread being joined (consumer_thread, in this case) finishes before the main thread reaches the that line of code (line 90)? In this program, the main thread calls pthread_join() on the threads it created. Could a different thread call pthread_join() on those threads instead? Could a thread call pthread_join() on the main thread (assuming it knew the main thread’s thread ID – i.e. pthread_t)? The consumer_routine function calls sched_yield() (line 194) when there are no items in the queue. Why does it call sched_yield() instead of just continuing to check the queue for an item until one arrives? Deliverables Submit the following on Gradescope: PDF file containing the answers to the above questions. Your modified source file (producer_consumer.c with the 5 bugs fixed). Upload both the files on Gradescope in the PreLab Assignment. On submission, you must be able to see the autograder output. This is the output we will use to verify if the modified code runs without any errors (segmentation fault, etc).
1. Suppose we are tagging a sentence “They run programs” with two types of tags: N (noun) and V (verb). The initial probabilities, transition probabilities and emission probabilities computed by an HMM model are shown below. N V π 0.8 0.2 Table 1: Initial probabilities N V N 0.4 0.6 V 0.8 0.2 Table 2: Transition probabilities they run programs N 0.6 0.2 0.2 V 0 0.6 0.4 Table 3: Emission probabilities(a) Given the HMM model above, compute the probability of the given sentence “They run programs” by using the Forward AND Backward Algorithm. Please show all the steps of calculations. [6 pts](b) Tag the sentence “They run programs” by using the Viterbi Algorithm. Please show all the steps of calculations. [4 pts] Solution. Solution goes here. 2. Suppose we are parsing a sentence “submit the homework through Canvas” with a Probability Context Free Grammar (PCFG) in Chomsky Normal Form. We give the PCFG as follows.(a) Parse the sentence “submit the homework through Canvas” by using the CKY Algorithm. You need to show the probabilities of all possible parses and choose the parse with the highest probability. Please show all the steps of calculations. [8 pts](b) Compute the probability of the given sentence “submit the homework through Canvas” by using the CKY Algorithm. Please show all the steps of calculations. [2 pts] Solution. Solution goes here. Rule Probability S −→ NP VP 0.8 S −→ submit 0.01 S −→ Verb NP 0.05 S −→ VP PP 0.03 NP −→ Canvas 0.16 NP −→ Gradescope 0.04 NP −→ Det Nominal 0.6 Nominal −→ Nominal Noun 0.2 Nominal −→ Nominal PP 0.5 Nominal −→ homework 0.15 VP −→ submit 0.1 VP −→ contain 0.04 VP −→ Verb NP 0.5 VP −→ VP PP 0.3 PP −→ Prep NP 1.0 Det −→ the 0.6 Prep −→ through 0.2 Verb −→ submit 0.5 Table 4: Initial probabilities3. For the programming task, you will implement several models for part-of-speech tagging. We will use the LSTM as the basic architecture for the model and try several modifications to improve performance. To do this, you will complete the Python code started in POS tagging.ipynb, available on Canvas along with the train and test data (train.txt and test.txt) in POS tagging.zip. For all the written questions to Q3 (accuracy reporting and error analysis), please provide your answers in the space provided inside the notebook.NOTE: for each model that you test, you will produce a file test labels.txt that you can upload to Gradescope to evaluate your model’s performance.(a) To begin, implement an LSTM tagger by completing the skeleton code provided in BasicPOSTagger in the notebook. The model will use word embeddings to represent the sequence of words in the sentence.For this problem, implement the forward pass and the training procedure for the tagger. After training for 30 epochs, the model should achieve at least 80% on the held-out data.In addition, compute the top-10 most frequent types of errors that the model made in labeling the data in test.txt (e.g. if the model labeled NN as VB 10 times) and report these errors in the written answer. Report the results in a table containing the top-10 mistakes, with the following columns: model tag type, ground truth tag type, error frequency, up to 5 example words that were tagged incorrectly. What kinds of errors did the model make and why do you think it made them?[5 pts](b) Word-level information is useful for part-of-speech tagging, but what about character level information? For instance, the present tense of English verbs is marked with the suffix -ing, which can be captured by a sequential character model.For this problem, implement the character-level model by completing the skeleton code provided in CharPOSTagger in the notebook. Unlike part (a), this model will use character embeddings. After training for 30 epochs, the model should achieve at least 80% on the held-out data.In addition, compare the top-10 types of errors made by the character-level model with the errors made by the word-level model from part (a) and report these errors in the written answer. Report the error results in the same format as before. What kinds of errors does the character-level model make as compared to the original model, and why do you think it made them? [10 pts](c) The previous models considered only a single direction, i.e. feeding the information from left to right (starting at word i = 1 and ending at word i = N). For the final model, you will implement a bidirectional LSTM that uses information from both left-to-right and right-to-left to represent dependencies between adjacent words in both directions. In order to decide if the word “call” is VB or NN, the model can use information from the following words to disambiguate (e.g. “call waiting” vs. “call him”).For this problem, implement the bidirectional model by completing the skeleton code provided in BiLSTMPOSTagger in the notebook. Like the model in part (a), this will use word embeddings. After training for 30 epochs, the model should achieve at least 90% on the held-out data.In addition, implement one of the three modifications listed in the notebook to boost your score: (a) different word embeddings to initialize the model (one different type), (b) different hyperparameters to initialize the model (five different combinations), (c) different model architecture on top of the BiLSTM (one different architecture). Explain in the written answer which modifications you have made, why you chose a certain modification over another (e.g. Glove over word2vec) and the resulting accuracy of the modified model.Lastly, compare the top-10 errors made by this modified model with the errors made by the model from part (a). Report the error results in the same format as before. If you tested multiple different hyperparameters, compute the errors for the model with the highest accuracy on the validation data. What errors does the original model make as compared to the modified model, and why do you think it made them? [10 pts]
1. The Kneser-Ney smoothing method approximates the probability of an n-gram that has not been seen using the likelihood of the (n-1)-gram to occur in diverse contexts.If we want to finish the sentence “I want to go to the movie ” with the word “theatre” but we have not observed “theatre” very often, then we want to make sure that we can guess “theatre” based on its likelihood of combining with other words.The full formula for the bigram version of Kneser-Ney smoothing follows: P(bigram) = discounted bigram probability + joint unigram probability (1) P(wi |wi−1) = max(C(wi−1, wi) − d, 0) C(wi−1) + λ(wi−1) |v : C(v, wi) > 0| P w0 |v : C(v, w0 ) > 0| (2) λ(wi−1) = d P v C(wi−1, v) ∗ |w : C(wi−1w) > 0| (3)Assume that you have collected the data in the following tables, and assume that all other observed counts are 0. In the bigram table, rows represent wi−1, columns represent wi : e.g. C(computer, keyboard) = 2.C(wi−1, wi) computer keyboard monitor store computer 0 2 4 4 keyboard 1 0 0 1 monitor 0 1 1 1 store 2 0 0 0 Table 1: Bigram frequency. Rows = wi−1, columns = wi . computer 10 keyboard 3 monitor 6 store 5 Table 2: Unigram frequency. Consider the following sentence fragment S: “I shopped at the computer ”.You need to determine whether the sentence is more likely to end with “computer store” or “computer monitor.” (a) Compute the raw bigram probabilities for the candidate words {store, monitor} to complete the sentence S, i.e. P(store|computer) and P(monitor|computer). Is one word more likely than the other, and if so which one? [2 pts](b) Compute the Kneser-Ney smoothed bigram probability of the candidate words {store, monitor} to complete the sentence. Use d = 0.5 as the discount term. Is one word more likely than the other, and if so which one? If the result has changed, why do you think it changed? [5 pts](c) Change the discount term to d = 0.1 and re-compute the Kneser-Ney smoothed bigram probability of the candidate words {store, monitor} to complete the sentence. Is one word more likely than the other, and if so which one? If the result has changed, why do you think it changed? [3 pts] Solution. Solution goes here. 2. Consider we have a term-document matrix for four words in three documents shown in Table 3. The whole document set has N = 20 documents, and for each of the four words, the document frequency dft is shown in Table 4. term-document Doc1 Doc2 Doc3 car 27 4 24 insurance 3 18 0 auto 0 33 29 best 14 0 17 Table 3: Term-document Matrix df car 12 insurance 6 auto 10 best 16 Table 4: Document Frequency(a) Compute the tf-idf weights for each word car, auto, insurance and best in Doc1, Doc2, and Doc3. [6 pts] (b) Use the tf-idf weight you get from (a) to represent each document with a vector and calculate the cosine similarities between these three documents. [4 pts] Solution. Solution goes here. 3. The distributional hypothesis suggests that the more similarity there is in the meaning of two words, the more distributionally similar they are, where a word’s distibution refers to the context in which it appears. This motivated the work by Mikolov et al. on the skip-gram model which is an efficient way of learning high quality dense vector representations of words from unstructured text.The objective of the skip-gram model is to learn the probability distribution P(O|I) where given an inside word wI , we intend to estimate the probability that an outside word wO lies in the context window of wI .The basic formulation of the skip-gram model defines this using the softmax function: P(O = wO|I = wI ) = exp(u T wO .vwI ) P w∈Vocab exp(uT w.vwI ) (4)Here, uwO is the word vector representing the outside word o and vwI is the word vector representing the inside word i. To update these parameters continually during training, we store these in two matrices U and V. The columns of V are all of the inside word vectors vwI while the columns of U are all the outside word vectors uwO and both these matrices contain a vector for each word in the vocabulary.(a) The cross entropy loss between two probability distributions p and q, is expressed as: CE(p, q) = − X i pi log(qi) (5)For, a given inside word wI = wk, if we consider the ground truth distribution y to be a one-hot vector (of length same as the size of vocabulary) with a 1 only for the true outside word wO and 0 everywhere else. The predicted distribution yˆ (of length same as the size of vocabulary) is the probability distribution P(wO|wI = wk).The i th entry in these vectors is the probability of the i th word being an outside word. Write down and simplify the expression for the cross entropy loss, CE(y, yˆ), for the skip-gram model described above for a single pair of words wO and wI . (Note: your answer should be in terms of P(O = wO|I = wI ).) [2 pts](b) Find the partial derivative of the cross entropy loss calculated in part (a) with respect to the inside word vector vwI . (Note: your answer should be in terms of y, yˆ and U.) [5 pts](c) Find the partial derivative of the cross entropy loss calculated in part (a) with respect to each of the outside word vectors uwO . (Note: Do this for both cases wO = O (true outside word) and wO 6= O (all other words). Your answer should be in terms of y, yˆ and vwI .) [5 pts](d) Explain the idea of negative sampling and the use of the parameter K. Write down the loss function for this case. (Note: your answer should be in terms of uwO , vwI and the parameter K.) [3 pts]4. In the textbook, language modeling was defined as the task of predicting the next word in a sequence given the previous words. In this assignment, you will implement character-level, word-level N-gram and character-level RNN language models. You need to both answer the questions and submit your code.You need to submit a zip file to Canvas Homework 3 Programming. This file should contain ngram.ipynb, rnn.ipynb hw3 skeleton char.py and hw3 skeleton word.py.You should also submit a zip file to the Gradescope assignment HW3 Language Models. This file should contain hw3 skeleton char.py and hw3 skeleton word.py. (a) For N-gram language models, You should complete two scripts hw3 skeleton char.py and hw3 skeleton word.py. Detailed instructions can be found in ngram.ipynb. You should also use test cases in ngram.ipynb and use ngram.ipynb to get development results for (c).You need to submit a zip file to Gradescope HW3 Language Models. This file should contain hw3 skeleton char.py and hw3 skeleton word.py. You can see the scores for your code there. Character-level N-gram language models accounts for 20 points. Word-level N-gram language models accounts for 10 points, which are bonus for CS 4650. [30pts for CS 7650, 20 pts + bonus 10pts for CS 4650](b) See the generation results of your character-level and word-level N-gram language models respectively (n ≥ 1). The paragraphs which character-level N-gram language models generate all start with F. The paragraphs which word-level N-gram language models generate all start with In. Did you get such results? Explain what is going on. (CS 4650 can only focus on character-level N-gram language model.) [2 pts](c) (Bonus for CS 4650) Compare the generation results of character-level and word-level N-gram language models. Which do you think is better? Compare the perplexity of nytimes article.txt when using character-level and word-level N-gram language models. Explain what you found. [2pts, bonus for CS 4650](d) When you compute perplexity, you can play with different sets of hyper-parameters in both character-level and word-level N-gram language models. You can tune n , k and λ. Please report here the best results and the corresponding hyper-parameters in development sets. For character-level N-gram language models, the development set is shakespeare sonnets.txt. For word-level N-gram language models, the development sets are shakespeare sonnets.txt and val e.txt. (CS 4650 should only focus on character-level N-gram language model.) [6 pts for CS 7650, 2 pts + bonus 4 pts for CS 4650](e) For RNN language models, You should complete the forward method of Class RNN in rnn.ipynb. You need to figure out the code and tune the hyperparameters. You should also copy a paragraph generated by your model and report the perplexity on the development set shakespeare sonnets.txt. Compare the results of character-level RNN language model and character-level N-gram language model. [10 pts]
1. A collection of reviews about comedy movies (data D) contains the following keywords and binary labels for whether each movie was funny (+) or not funny (-). The data are shown below: for example, the cell at the intersection of “Review 1” and “laugh” indicates that the text of Review 1 contains 2 tokens of the word “laugh.” Review laugh hilarious awesome dull yawn bland Y 1 2 1 1 1 1 0 + 2 0 1 2 0 0 0 + 3 3 0 0 0 0 1 + 4 0 1 0 2 1 0 – 5 1 1 1 2 0 2 – 6 1 0 0 2 2 0 –You may find it easier to complete this problem if you copy the data into a spreadsheet and use formulas for calculations, rather than doing calculations by hand. Please report all scores as log-probabilities, with 3 significant figures. [10 pts](a) Assume that you have trained a Naive Bayes model on data D to detect funny vs. not funny movie reviews. Compute the model’s predicted score for funny and not-funny to the following sentence S (i.e. P(+|S) and P(−|S)), and determine which label the model will apply to S. [4 pts]S: “This film was hilarious! I didn’t yawn once. Not a single bland moment. Every minute was a laugh.” (b) The counts in the original data are sparse and may lead to overfitting, e.g. a strong prior on assigning the “not funny” label to reviews that contain “yawn.” What would happen if you applied smoothing? Apply add-1 smoothing and recompute the Naive Bayes model’s predicted scores for S. Did the label change? [4 pts] (c) What is an additional feature that you could extract from text to improve the classification of sentences like S, and how would it help improve the classification? [2 pt]2. [CS 7650 Only] Assume that you are training several logistic regression models. After training on the same data, ˆθ is the optimal weight for an unregularized logistic regression model and θ ∗ is the optimal weight for a logistic regression model with L2 regularization. Prove that ||θ ∗ ||2 2 ≤ ||ˆθ||2 2 . Note: you may find it useful to look at the likelihood equations for regularized and unregularized logistic regression. [5 pts]3. Language Modeling is the technique that allows us to compute the probabilities of word sequences. The probability of a sequence W = w n 1 = {w1, w2…wn}, with the use of chain rule, can be estimated as the product of probabilities of each word given the history, as shownP(W) = P(w1, w2…wn) = P(w1) P(w2|w1) P(w3|w1, w2)…P(wn|w1, w2…wn−1) = Yn i=1 P(wi |w i−1 1 )(a) Using an n-gram model allows us to approximate the above probability using only a subset of of n − 1 words from the history at each step. Simplify the above expression for the general n-gram case, and the bi-gram case. [3 pts](b) A common way to have markers for the start and the end of sentence is to add the [BOS] (beginning of sentence) and [EOS] (end of sentence) tokens at the start and end of every sentence. Consider the following text snippet- [BOS] i made cheese at home [EOS] [BOS] i like home made cheese [EOS] [BOS] cheese made at home is tasty [EOS] [BOS] i like cheese that is salty [EOS]Using the expression derived in (a), find the probability of the following sequence as per the bi-gram model- P([BOS] I like cheese made at home [EOS]). [5 pts](c) In practice, instead of raw probability, perplexity is used as the metric for evaluating a language model. Define perplexity and find the value of perplexity for the sequence in (b) for the bi-gram case. [2 pts](d) One way to deal with unseen word arrangements in the test set is to use Laplace smoothing, which adds 1 to all bi-gram counts, before we normalize them into probabilities. An alternative to Laplace smoothing (add-1 smoothing) is add-k smoothing, where k is a fraction that allows assigning a lesser probability mass to unseen word arrangements. Find the probability of the sequence in (b) with add-k smoothing for k = 0.1. [5 pts](e) To deal with unseen words in the test set, a common way is to fix a vocabulary by thresholding on the frequency of words, and assigning an [UNK] token to represent all out-of-vocabulary words. In the example from (a), use a threshold of count > 1 to fix the vocabulary. Find the probability for the following sequence for an add-0.1 smoothed bi-gram model- P([BOS] i like pepperjack cheese [EOS]). [5 pts]4. In this problem, you will do text classifications for Hate Speech. You need both answer the questions and submit your codes. Hate speech is a (a) deliberate attack, (b) directed towards a specific group of people, (c) motivated by aspects of the group’s identity.The three premises must be true for a sentence to be categorized as HATE. Here are two examples: (a) “Poor white kids being forced to treat apes and parasites as their equals.” (b) “Islam is a false religion however unlike some other false religions it is crude and appeals to crude people such as arabs.”In (a), the speaker uses “apes” and “parasites” to refer to children of dark skin and implies they are not equal to “white kids”. That is, it is an attack to the group composed of children of dark skin based on an identifying characteristic, namely, their skin colour. Thus, all the premises are true and (a) is a valid example of HATE. Example (b) brands all people of Arab origin as crude. That is, it attacks the group composed of Arab people based on their origin.Thus, all the premises are true and (b) is a valid example of HATE. This problem will require programming in Python 3. The goal is to build a Naive Bayes model and a logistic regression model that you learnt from the class on a real-world hate speech classification dataset. Finally, you will explore how to design better features and improve the accuracy of your models for this task.The dataset you will be using is collected from Twitter online. Each example is labeled as 1 (hatespeech) or 0 (Non-hatespeech). To get started, you should first download the data and starter code from https://www.cc.gatech.edu/classes/AY2020/cs7650_ spring/programming/h2_text_classification.zip. Try to run: python main.py — model AlwaysPredictZeroThis will load the data and run a default classifier AlwaysPredictZero which always predicts label 0 (non-hatespeech). You should be able to see the reported train accuracy = 0.4997. That says, always predicting non-hatespeech isn’t that good. Let’s try to build better classifiers!Note that you need to implement models without using any machine learning packages such as sklearn. We will only provide train set, and we will evaluate your code based on our test set.To have a quick check with your implementations, you can randomly split the dataset we give you into train and test set at a ration 8:2, compare the accuracy between the models you have implemented and related models in sklearn packages. You would expect an accuracy at around 0.65 (or above) on your test set.(a) (Naive Bayes) In this part, you should implement a Naive Bayes model with add-1 smoothing, as we taught in the class. You are required to implement the NaiveBayesClassifier class in classifiers.py. You would probably want to take a look at the UnigramFeature class in utils.py that we have implemented for you already. After you finish your codes, run python main.py –model NaiveBayes to check the performance. List the 10 words that, under your model, have the higest ratio of P(w|1) P(w|0) (the most distinctly hatespeech words). List the 10 words with the lowest ratio. What trends do you see? [25 pts](b) (Logistic Regression) In this part, you should implement a Logistic Regression model. You are required to implement the LogisticRegressionClassifier class in classifiers.py. First, implement a logistic regression model without regularization and run python main.py –model LogisticRegression, compare the performance with your Naive Bayes approach. Next, we would like to experiment with L2 regularization, add L2 regularization with different weight such as α = {0.0001, 0.001, 0.01, 0.1, 1, 10}, describe what you observed. (You may want to split the train set we give you into your own train and test set to observe the performance) [25 pts](c) (Features) In the last part, you’ll explore and implement a more sophisicated set of features. You need to implement the class BigramFeature or modify the class CustomFeature inutils.py. Here are some common strategies (you are welcome to implement some of them but try to come up with more!): i. Remove stopwords (e.g. a, the, in), ii. Use a mixture of unigrams, bigrams or trigrams, iii. Use TF-IDF (refer to http://www.tfidf.com/) features.Use your creativity for this problem and try to obtain an accuracy as high as possible on your test set! After you implement CustomFeature , run: python main.py –model NaiveBayes — feature customized python main.py –model LogisticRegression — feature customized Describe the features that you have implemented. We’ll evaluate your two models on the test set. [Bonus: 10 points]You will receive up to 10 bonus points: up to 5 points based on the novel features you try and the rest based on how well your models perform compared to othersubmissions: Bonus = 5 + 5 ∗ 1 rank e.g. if you rank first in the class, you will receive the full bonus point! We will share the winners’ codes as well.
2. The entropy of a discrete random variable X is defined as (use base e for all log operations unless specified otherwise): H(X) = − X x∈X P(x) log P(x) (a) Compute the entropy of the distribution P(x) = Multinoulli([0.2, 0.3, 0.5]). [3 pts] (b) Compute the entropy of the uniform distribution P(x) = 1 m ∀x ∈ [1, m]. [3 pts] (c) Consider the entropy of the joint distribution P(X, Y): H(X, Y ) = − X x∈X X y∈Y P(x, y) log P(x, y) How does this entropy relate to H(X) and H(Y), (i.e. the entropies of the marginal distributions) when X and Y are independent? [4 pts] Solution. Solution goes here. 3. You are investigating articles from the New York Times and from Buzzfeed. Some of the articles contain fake news, while others contain real news (assume that there are only two types of news). Note: for the following questions, write your answer using up to 3 significant figures. (a) Fake news only accounts for 5% of all articles in all newspapers. However, it is known that 30% of all fake news comes from Buzzfeed. In addition, Buzzfeed generates 25% of all news articles. What is the probability that a randomly chosen Buzzfeed article is fake news? [3 pts](b) Suppose that 15% of all fake news comes from the New York Times (NYT). Furthermore, suppose that 60% of all real news comes from the NYT. Under all assumptions so far, what is the probability that a randomly chosen NYT article is fake news? [3 pts](c) Mike is an active reader of the New York Times: Mike reads 80% of all NYT articles. However, he also has a suspicion that the NYT is a bad publisher, and he believes that 25% of all NYT articles are fake news. Furthermore, the NYT generates 30% of all news articles. Under all assumptions so far, what is the probability that a randomly chosen article (from all newspapers) will be from the NYT, will be read by Mike and will be believed to be fake news? [4 pts] Solution. Solution goes here. 4. Suppose we have a probability density function (pdf) defined as: f(x, y) = ( C(x 2 + 2y), 0 < x < 1 and 0 < y < 1, 0, otherwise. (a) Find the value of C. [2pts] (b) Find the marginal distribution of X and Y . [4pts] (c) Find the joint cumulative density function (cdf) of X and Y . [4pts] Solution. Solution goes here. 5. [Graduate Students Only] A 2-D Gaussian distribution is defined as: G(x, y) = 1 2πσ2 exp − x 2 + y 2 2σ 2 Compute the following integral: Z ∞ −∞ Z ∞ −∞ G(x, y) (5x 2 y 2 + 3xy + 1) dx dy Hint: Think in terms of the properties of probability distribution functions. [5 pts] Solution. Solution goes here.
Consider a simple 2-state MDP shown in Figure 1, S = {S1, S2}. From each state, there are two available actions A = {stay, go}. The reward received for taking an action at a state is shown in the figure for each (s, a) pair. Also, taking action go from state S2 ends the episode. All transitions are deterministic. Figure 1: 2-state MDP(a) Consider a policy that always chooses the stay action at every state. Compute the sum of discounted rewards obtained by this policy starting at state S1, assuming a discount of γ i.e. compute P∞ t=0 γ t rt(st , at). [4 points](b) What is the optimal policy? Write it as a tuple (a1, a2) of the optimal actions at (s1, s2) respectively. Compute the sum of discounted rewards obtained by this policy, given that the start state is S1, with discount γ. [4 points](c) Recall that one update of the value iteration algorithm performs the following operation, where r(s, a) denotes the reward for taking action a at state s. For each s ∈ S : V i+1(s) ← max a X s 0 p(s 0 |s, a)r(s, a) + γV i (s 0 )(1)The value function V is stored as an array of size |S| ( |S| = 2 in our case). For example, the array V 0 = [0, 1] denotes that V 0 (s1) = 0, V 0 (s2) = 1. Starting with a initial V 0 = [0, 0], peform the value iteration update to compute V 1 , V 2 , V 3 . What is the optimal V ∗ ? [7 points]In part (c) of the previous question, we computed V i for a few updates of value iteration and also V ∗ , the optimal value function. What happened to the “error” in the value estimate at every update i.e. did |V i (s)−V ∗ (s)| every increase for a particular state s? The answer is yes, the value estimate for any s may fluctuate before eventually converging to V ∗ (s). If this is the case, do we have any hope that the value iteration algorithm will converge to V ∗ ? In this question, we will prove that value iteration indeed converges due to the monotonic decrease in a different error – the maximum error over all states max s∈S |V i (s) − V ∗ (s)| or the L∞ normV i − V ∗∞ . (a) For V 1 , V 2 , V 3 obtained in 1(c), computeV i − V ∗∞ and verify that this error decreased monotonically. [5 points] 2 (b) Let T : R |S| → R |S| be the function that takes the vector V i as input and applies the value iteration update to produce V i+1. We will show that for any two vectors V, V 0 ∈ R |S|, this operator decreases the L∞ norm between them. Prove that: kT(V ) − T(V 0 )k∞ ≤ γ kV − V 0k∞ for any V, V 0 ∈ R |S| , [10 points] (c) Now consider the sequence of value functions {V n} obtained by iteratively performing the value iteration update as per Eq. 1.This sequence has the interesting property of being a cauchy sequence i.e. the elements of the sequence become arbitrarily close to each other as the sequence progresses.Formally, we say a sequence is cauchy w.r.t. the L∞ norm if for every positive number , there is a positive integer N s.t. for all positive integers m, n greater than N, kxm − xnk∞ < . Equipped with this fact, show that the error of V n+1 w.r.t the optimal value function can be bounded as: ∀ > 0, ∃N > 0 s.t. ∀n > NV n+1 − V ∗∞ ≤ γ 1 − γ , (2) [10 points] Hint: First try to prove thatV n+1 − V ∗∞ ≤ γ 1−γV n − V n+1∞ , then apply the Cauchy sequence condition on V nDiscussion: Given Equation 2, we now have a guarantee that value iteration converges to V ∗ . However, we made the assumption that {V n} is a Cauchy sequence. As a bonus question below, we will prove that the condition in (a) is sufficient to conclude that {V n} is indeed a cauchy sequence. (d) [BONUS] Given a function T : R n → R n such that kT(x1) − T(x2)k∞ ≤ α kx1 − x2k∞, α ∈ [0, 1), prove first that T has a fixed point i.e. ∃x ∗ : T(x ∗ ) = x ∗ and second that the fixed point of T is unique. [5 bonus points]This is a challenging question with all sub-parts as bonus credit. While value iteration (VI) allows us to find the optimal policy, performing VI updates (Eq. 1) requires the transition function T(s, a) := p(s 0 | s, a) 1 and the reward function R(s, a). In many practical situations, these quantities are unknown. However, we can step through the environment and observe transitions to collect data of the form (s, a, s0 , r).Using this data, if we can obtain an estimate of the transition and reward functions, we can hope to still find the optimal policy via value iteration. Let M denote the true (unknown to us) model of the world (a model consists of both a transition and reward function), and let Mˆ represent our approximate model of the world that we estimate from data.In this question, given estimates of both the transition function T(s, a) = p(s 0 | s, a) ∈ ∆|S|−1 and the reward function R(s, a), we want to study the error of acting optimally in this estimated MDP as opposed to acting optimally in the “true” MDP.Naturally, we should expect this error to be smaller as we get better estimates which in turn is proportional to how many observations we make. 1we will denote T(s, a, s0 ) as the probability value of s 0 give (s, a), whereas T(s, a) denotes the entire probability distribution p(s 0 | s, a) 3(a) [BONUS] For the first part of this problem, assume that we have used our favorite learning method to obtain estimates of the transition function Tˆ and the reward function Rˆ. Given that maxs,a |Rˆ(s, a) − R(s, a)| ≤ R and maxs,a ||Tˆ(s, a) − T(s, a)||1 ≤ P , then for any policy π : S → A, ∀s ∈ S, show that [10 bonus points] ||V π Mˆ − V π M||∞ ≤ R 1 − γ + γP (1 − γ) 2 (3)Assume that the reward function is in [0, 1]. Hint: 1) Expand VM(s) using the Bellman equation 2) Holder’s Inequality: ||u · v||1 ≤ ||u||1 · ||v||∞ (b) [BONUS] We just bounded the error of acting according to any policy π : S → A under the estimated model, Mˆ vs. the true model, M. Now, we use this to bound the value error in acting optimally vs. acting optimally according to the estimated model (π ∗ Mˆ ). Using (3), show that [3 bonus points] V ∗ M(s) − V π ∗ Mˆ M (s) ≤ 2 sup π:S→A ||V π Mˆ − V π M||∞ (4) (c) [BONUS] Given R = r 1 2n log 4|S × A| δ (5) P = |S| · r 1 2n 4|S × A × S| δ (6) use (3) to simplify (4) i.e. V ∗ M(s)−V π ∗ M M (s). What is the dependence on this error and number of observations required for each state action pair, n? [2 bonus points] (d) [BONUS] Given a dataset Ds,a = {(s, a, s0 1 , r1), . . .(s, a, s0 n , rn)} we estimate the unknown transition and reward function using empirical frequencies as: Tˆ(s, a, s0 ) = 1 n Xn i=1 1(s 0 i = s 0 ) (7) Rˆ(s, a) = 1 n Xn i=1 r (8) For notational convenience, we will use Tˆ(s, a) to denote a vector of size |S| whose element is the probability computed in (7) for each next state, s 0 ∈ S. Prove that R and P take values as in (5) and (6). [5 bonus points] Hint: Hoeffding’s inequality2 2 https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#General_case_of_bounded_random_variables4 Policy Gradients Variance Reduction [10 points] In class (will be covered on Oct 29th, but you don’t really need to know the derivation for this question), we derived the policy gradient as ∇θJ(θ) = ∇θEτ∼πθ [R(τ )] (9) = Eτ∼πθ [R(τ )∇ log πθ(τ )] (10) ≈ 1 N X N i=1 R(τi)∇θ log πθ(τi) (11) where τ is a trajectory3 , R(τ ), the associated reward and N, the number of trials to estimate the expected reward, J(·), the expected reward and θ, the parameters of the policy. These gradients are often very noisy and lead to unstable training. In this question, we show a simple method to reduce the variance in the gradients. (a) Show that adjusting the reward as R(τ ) := R(τ ) − b does not change the gradient estimate in Eq. 9. [2 points] (b) Compute the variance of ∇θJ(θ) and notice how subtracting b helped reduce it. What value of b will lead to the least variance? [8 points] Hint: The variance of J(θ) is written as V ar(x) = E[x] − (E[x])2 where x is substituted as R(τ )∇θ log πθ(τ ) 5 Coding: Dynamic Programming and Deep Q-Learning [50 points + 15 bonus points] Follow the instructions at this link to the HW4 coding webpage. 3A trajectory is defined as a sequence of states and actions i.e. (s0, a0, s1, a1, . . . sn−1, an−1, sn)
The convolution layer inside of a CNN is intrinsically an affine transformation: A vector is received as input and is multiplied with a matrix to produce an output (to which a bias vector is usually added before passing the result through a nonlinearity). This operation can be represented as y = Ax, in which A describes the affine transformation.1. [2 points] We will first revisit the convolution layer as discussed in the class. Consider a convolution layer with a 2×2 kernel W, operated on a single input channel X, represented as: W = w(0,0), w(0,1) w(1,0), w(1,1) , X = x(0,0), x(0,1), x(0,2) x(1,0), x(1,1), x(1,2) x(2,0), x(2,1), x(2,2) (1)Now let us work out a stride-3 convolution layer, with zero padding size of 1. Consider ‘flattening’ the input tensor X in row-major order as: X =x(0,0), x(0,1), x(0,2), …, x(2,0), x(2,1), x(2,2)> (2) Write down the convolution as a matrix operation A that: Y = AX. Output Y is also flattened in row-major order.2. [2 points] Recall that transposed convolution can help us upsample the feature size spatially. Consider a transposed convolution layer with a 2×2 kernel W operated on a single input channel X, represented as: W = w(0,0), w(0,1) w(1,0), w(1,1) , X = x(0,0), x(0,1) x(1,0), x(1,1) (3)We ‘flattened’ the input tensor in row-major order as X = [x(0,0), x(0,1), x(1,0), x(1,1)]. Write down the affine transformation A corresponding to a transposed convolution layer with kernel W, stride 2, no padding. Output Y is also flattened in row-major order.3. [4 points] Convolution layers in most CNNs consist of multiple input and output feature maps. The collection of kernels form a 4D tensor (output channels o, input channels i, filter rows k, filter columns k), represented in short as (o, i, k, k). For each output channel, each input channel is convolved with a distinct 2D slice of the 4D kernel and the resulting set of feature maps is summed element-wise to produce the corresponding output feature map.There is an interesting property that a convolutional layer with kernel size (o × r 2 , i, k, k) is identical to a transposed convolution layer with kernel size (o, i, k × r, k × r). Here the word ‘identical’ means with the same input feature X, both operations will give the same output Y with only a difference in the ordering of flattened elements of Y .Now let us prove the property in a restricted setting. Consider o = 1, r = 2, i = 1, k = 1. Given the same input feature X as in equation 3, write down the affine transformation for a convolutional layer with kernel size (4, 1, 1, 1), and show it is an operation identical to a transposed convolution layer with kernel size (1, 1, 2, 2).1. [4 points] Implement AND and OR for pairs of binary inputs using a single linear threshold neuron with weights w ∈ R 2 , bias b ∈ R, and x ∈ {0, 1} 2 : f(x) = ( 1 if wT x + b ≥ 0 0 if wT x + b < 0 (4) That is, find wAND and bAND such that x1 x2 fAND(x) 0 0 0 0 1 0 1 0 0 1 1 1 Also find wOR and bOR such that x1 x2 fOR(x) 0 0 0 0 1 1 1 0 1 1 1 12. [4 points] Consider the XOR function x1 x2 fXOR(x) 0 0 0 0 1 1 1 0 1 1 1 0 Show that XOR can NOT be represented using a linear model with the same form as (4). [Hint: To see why, plot the examples from above in a plane and think about drawing a linear boundary that separates them.]Consider a specific 2 hidden layer ReLU network with inputs x ∈ R, 1 dimensional outputs, and 2 neurons per hidden layer. This function is given by h(x) = W(3) max{0, W(2) max{0, W(1)x + b (1)} + b (2)} + b (3) (5) with weights: W(1) = 0.5 0.5 (6) b (1) = 0 1 (7) W(2) = 1 1 1 1 (8) b (2) = 0 0 (9) W(3) =1 1 (10) b (3) = 1 (11)An interesting property of networks with piecewise linear activations like the ReLU is that on the whole they compute piecewise linear functions. For each of the following points give the weight W ∈ R and bias b ∈ R (report the numerical values) which computes such that W x + b = h(x).Also compute the gradient dh dx evaluated at the given point. 1. [2 points] x = 1 (12) 2. [2 points] x = −1 (13) 3. [2 points] x = −0.5 (14)Now we’ll turn to a more recent result that highlights the Deep in Deep Learning. Depth (composing more functions) results in a favorable combinatorial explosion in the “number of things that a neural net can represent”. For example, to classify a cat it seems useful to first find parts of a cat: eyes, ears, tail, fur, etc. The function which computes a probability of cat presence should be a function of these components because this allows everything you learn about eyes to generalize to all instances of eyes instead of just a single instance. Below you will detail one formalizable sense of this combinatorial explosion for a particular class of piecewise linear networks.Figure 1 Consider y = σ(x) = |x| for scalar x ∈ X ⊆ R and y ∈ Y ⊆ R (Fig. 1). It has one linear region on x < 0 and another on x > 0 and the activation identifies these regions, mapping both of them to y > 0. More precisely, for each linear region of the input, σ(·) is a bijection. There is a mapping to and from the output space and the corresponding input space. However, given an output y, it’s impossible to tell which linear region of the input it came from, thus σ(·) identifies (maps on top of each other) the two linear regions of its input. This is the crucial definition because when a function identifies multiple regions of its domain that means any subsequent computation applies to all of those regions. When these regions come from an input space like the space of images, functions which identify many regions where different images might fall (e.g., slightly different images of a cat) automatically transfer what they learn about a particular cat to cats in the other regions. More formally, we will say that σ(·) identifies a set of M disjoint input regions R = {R1, . . . , RM} (e.g., R = {(−1, 0),(0, 1)}) with Ri ⊆ X onto one output region O ⊆ Y (e.g., (0, 1)) if for all Ri ∈ R there is a bijection from Ri to O.1. [4 points] Start by applying the above notion of identified regions to linear regions of one layer of a particular neural net that uses absolute value functions as activations. Let x ∈ R d , y ∈ R d 2 , and pick weights W(1) ∈ R d×d and bias b (1) ∈ R d as follows: W (1) ij = ( 2 if i = j 0 if i 6= j (15) b (1) i = −1 (16)Then one layer of a neural net with absolute value activation functions is given by f1(x) = |W(1)x + b| (17) Note that this is an absolute value function applied piecewise and not a norm. How many regions of the input are identified onto O = (0, 1)d by f1(·)? Prove it. 31Recall that a bijection from X to Y is a function µ : X → Y such that for all y ∈ Y there exists a unique x ∈ X with µ(x) = y. 2Outputs are in some feature space, not a label space. Normally a linear classifier would be placed on top of what we are here calling y. 3Absolute value activations are chosen to make the problem simpler, but a similar result holds for ReLU units. Also, O could be the positive orthant (unbounded above). 5 2. [4 points] Next consider what happens when two of these functions are composed. Suppose g identifies ng regions of (0, 1)d onto (0, 1)d and f identifies nf regions of (0, 1)d onto (0, 1)d . How many regions of its input does f ◦ g(·) identify onto (0, 1)d ?3. [6 points] Finally consider a series of L layers identical to the one in question 3.1. h1 = |W1x + b1| (18) h2 = |W2h1 + b2| (19) . . . (20) hL = |WLhL−1 + bL| (21) Let x ∈ (0, 1)d and f(x) = hL. Note that each hi is implicitly a function of x. Show that f(x) identifies 2 Ld regions of its input.Now compare the number of identified regions for an L layer net to that of an L − 1 layer net. The L layer net can separate its input space into 2 d more linear regions than the L − 1 layer net. On the other hand, the number of parameters and the amount of computation time grows linearly in the number of layers. In this very particular sense (which doesn’t always align well with practice) deeper is better.To summarize this problem set, you’ve shown a number of results about the representation power of different neural net architectures. First, neural nets (even single neurons) can represent logical operations. Second, neural nets we use today compute piecewise linear functions of their input. Third, the representation power of neural nets increases exponentially with the number of layers. The point of the exercise was to convey intuition that removes some of the magic from neural nets representations. Specifically, neural nets can decompose problems logically, and piecewise linear functions can be surprisingly powerful.6 Coding: Uses of Gradients With Respect to Input [50 points for CS7643, 40 points for CS4803] The coding part of this assignment will have you implement 3 (CS4803) or 4 (CS7643) different ideas which all rely on gradients of some neural net output with respect to the input image. To get started go to https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/CQQNDw8CAwYGAAQLBAUECw/hw2-q6/.
1. [Vanilla RNN for parity function: 4 points] Let us define a sequence parity function as a function that takes in a sequence of binary inputs and returns a sequence indicating the number of 1’s in the input so far; specifically, if at time t the 1’s in the input so far is odd it returns 1, and 0 if it is even. For example, given input sequence [0, 1, 0, 1, 1, 0], the parity sequence is [0, 1, 1, 0, 1, 1]. Let xt denote the input at time t and yt be the boolean output of this parity function at time t. Design a vanilla recurrent neural network to implement the parity function. Your implementation should include the equations of your hidden units and the details about activations with different input at xt (0 or 1). Hint: Recall that we have implemented AND and OR functions in problem set 2, which may be useful here.2. [LSTM for parity function: 4 points]Let us recall different gates in an LSTM Network. First gate is the “forget gate layer”: ft = σ(Wf .[ht−1, xt ] + bf ) (1) where ft is the output of forget gate, Wf is the weight matrix, ht−1 is the hidden state of step t-1, xt is the current input and bt is the bias value. Next we have “input gate layer”: it = σ(Wi .[ht−1, xt ] + bi) (2) C˜ t = tanh(WC.[ht−1, xt ] + bC) (3) where it decides which values we will update and C˜ t are the new candidate values that could be added to the cell state.Next we have new cell state candidate values: Ct = ft ∗ Ct−1 + it ∗ C˜ t (4) Finally, we have the output and hidden state ot = σ(Wo.[ht−1, xt ] + bo) (5) ht = ot ∗ tanh(Ct) (6) Design an LSTM Network for the bit parity problem mentioned in Question 1. Specifically, provide values for Wf , bf , Wi , bi , WC, bC, Wo and bo such that the cell state Ct store the parity of bit string. Please mention any assumptions you make. For this problem, you can assume below for Sigmoid and tanh function: σ(x) = ( 1, if x > 0 0, otherwise (7) tanh(x) = 1, if x > 0 0, x = 0 −1, if x < 0 (8)2 Hint: Recall that XOR of x and y can be represented as (x ∧ y¯) ∨ (x¯ ∧ y). Think about how you can leverage this information for equation (4). 3. [When to stop in beam search?: 4 points] Beam Search is a technique widely used to improve the output text quality of neural text generation. But it is difficult to decide when to stop beam search to obtain optimality because hypotheses can finish in different steps. Assume an input x from which we generate an output sentence y which is a completed hypothesis: y∗ = argmax y:comp(y) Y i≤|y| p(yi |x, yht−1 (12) where W is a weight sharing matrix for recurrent relation at any time t. Let λ1, …, λn be the eigenvalues of the weight matrix W ∈ C n×n . Its spectral radius ρ(W) is defined as: 3 ρ(W) = max{|λ1|, …, |λn|} (13) Assuming the initial hidden state is h0, write the relation between hT and h0 and explain the role of the eigenvalues of W in determining the ‘vanishing’ or ‘exploding’ property as T 0 2 Coding: Sequence models for image captioning [55 regular points for CS7643, 45 regular points for CS4803] The coding part of this assignment will consist of implementation of sequence models for captioning images. To get started, go to https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/ 8cc2zXixdvAocq4v4upA9A/hw3/ 4
1. (3 points) We often use iterative optimization algorithms such as Gradient Descent to find w that minimizes a loss function f(w). Recall that in gradient descent, we start with an initial value of w (say w(1)) and iteratively take a step in the direction of the negative of the gradient of the objective function i.e. w(t+1) = w(t) − η∇f(w(t) ) (1) for learning rate η > 0.In this question, we will develop a slightly deeper understanding of this update rule, in particular for minimizing a convex function f(w). Note: this analysis will not directly carry over to training neural networks since loss functions for training neural networks are typically not convex, but this will (a) develop intuition and (b) provide a starting point for research in non-convex optimization (which is beyond the scope of this class).Recall the first-order Taylor approximation of f at w(t) : f(w) ≈ f(w(t) ) + hw − w(t) , ∇f(w(t) )i (2) When f is convex, this approximation forms a lower bound of f, i.e. f(w) ≥ f(w(t) ) + hw − w(t) , ∇f(w(t) )i | {z } affine lower bound to f(·) ∀w (3)Since this approximation is a ‘simpler’ function than f(·), we could consider minimizing the approximation instead of f(·). Two immediate problems: (1) the approximation is affine (thus unbounded from below) and (2) the approximation is faithful for w close to w(t) .To solve both problems, we add a squared `2 proximity term to the approximation minimization: argmin w f(w(t) ) + hw − w(t) , ∇f(w(t) )i | {z } affine lower bound to f(·) + λ 2 |{z} trade-offw − w(t)2 | {z } proximity term (4)Notice that the optimization problem above is an unconstrained quadratic programming problem, meaning that it can be solved in closed form (hint: gradients). What is the solution w∗ of the above optimization? What does that tell you about the gradient descent update rule? What is the relationship between λ and η?2. (3 points) Let’s prove a lemma that will initially seem devoid of the rest of the analysis but will come in handy in the next sub-question when we start combining things. Specifically, the analysis in this sub-question holds for any w? , but in the next sub-question we will use it for w? that minimizes f(w).Consider a sequence of vectors v1, v2, …, vT , and an update equation of the form w(t+1) = w(t) − ηvt with w(1) = 0. Show that: X T t=1 hw(t) − w? , vti ≤ ||w? ||2 2η + η 2 X T t=1 ||vt ||2 (5)3. (3 points) Now let’s start putting things together and analyze the convergence rate of gradient descent i.e. how fast it converges to w? . First, show that for ¯w = 1 T PT t=1 w(t) 2 f(¯w) − f(w? ) ≤ 1 T X T t=1 hw(t) − w? , ∇f(w(t) )i (6)Next, use the result from part 2, with upper bounds B and ρ for ||w? || and∇f(w(t) )respectively and show that for fixed η = q B2 ρ 2T , the convergence rate of gradient descent is O(1/ √ T) i.e. the upper bound for f(¯w) − f(w? ) ∝ √ 1 T . 4. (2 points) Consider a objective function comprised of N = 2 terms: f(w) = 1 2 (w − 2)2 + 1 2 (w + 1)2 (7)Now consider using SGD (with a batch-size B = 1) to minimize this objective. Specifically, in each iteration, we will pick one of the two terms (uniformly at random), and take a step in the direction of the negative gradient, with a constant step-size of η. You can assume η is small enough that every update does result in improvement (aka descent) on the sampled term.Is SGD guaranteed to decrease the overall loss function in every iteration? If yes, provide a proof. If no, provide a counter-example.5. (4 points) In practice, writing the closed-form expression of the derivative of a loss function f w.r.t. the parameters of a deep neural network is hard (and mostly unnecessary) as f becomes complex. Instead, we define computation graphs and use the automatic differentiation algorithms (typically backpropagation) to compute gradients using the chain rule. For example, consider the expression f(x, y) = (x + y)(y + 1) (8)Let’s define intermediate variables a and b such that a = x + y (9) b = y + 1 (10) f = a × b (11) A computation graph for the “forward pass” through f is shown in Fig. 1. Figure 1We can then work backwards and compute the derivative of f w.r.t. each intermediate variable ( ∂f ∂a , ∂f ∂b ) and chain them together to get ∂f ∂x and ∂f ∂y . Let σ(·) denote the standard sigmoid function. Now, for the following vector function: f1(w1, w2) = e e w1+e 2w2 + sin(e w1 + e 2w2 ) (12) f2(w1, w2) = w1w2 + σ(w1) (13)(a) Draw the computation graph. Compute the value of f at ~w = (1, 2). (b) At this ~w, compute the Jacobian ∂f~ ∂ ~w using numerical differentiation (using ∆w = 0.01). (c) At this ~w, compute the Jacobian using forward mode auto-differentiation. (d) At this ~w, compute the Jacobian using backward mode auto-differentiation. (e) Don’t you love that software exists to do this for us?One important property for feed-forward network that we have discussed in class is that it must be a directed acyclic graph (DAG). Recall that a DAG is a directed graph that contains no directed cycles. We will study some of its properties in this question.Define G = (V, E) in which V is the set of all nodes as {v1, v2, …, vi , …vn} and E is the set of edges E = ei,j = (vi , vj ) | vi , vj ∈ V. A topological order of a directed graph G = (V, E) is an ordering of its nodes as {v1, v2, …, vi , …vn} so that for every edge (vi , vj ) we have i < j.There are several lemmas can be inferred from the definition of DAG. One lemma is: if G is a DAG, then G has a node with no incoming edges. Given the above lemma, prove the following two lemmas: 6. [2 points] If the graph G is a DAG, then G has a topological ordering. 7. [2 points] If the graph G has a topological order, then G is a DAG.4 Implement and train a network on CIFAR-10 Setup Instructions: Before attempting this question, look at setup instructions here. 8. (Upto 29 points) Now, we will learn how to implement a softmax classifier, vanilla neural networks (or Multi-Layer Perceptrons), and ConvNets.You will begin by writing the forward and backward passes for different types of layers (including convolution and pooling), and then go on to train a shallow ConvNet on the CIFAR-10 dataset in Python. Next you will learn to use PyTorch, a popular open-source deep learning framework, and use it to replicate the experiments from before. Follow the instructions provided here.
CSE 6242 / CX 4242: Data and Visual Analytics HW 1: End-to-end analysis of TMDb data, SQLite, D3 Warmup, OpenRefine, FlaskVast amounts of digital data are generated each day, but raw data is often not immediately “usable”. Instead, we are interested in the information content of the data such as what patterns are captured? This assignment covers useful tools for acquiring, cleaning, storing, and visualizing datasets. In questions 1 & 2, we’ll perform a simple end-to-end analysis using data from The Movie Database (TMDb).We will collect movie data via API, store the data in csv files, and analyze data using SQL queries. For Q3, we will complete a D3 warmup to prepare our students for visualization questions in HW2. Q4 & 5 will provide an opportunity to explore other industry tools used to acquire, store, and clean datasets. The maximum possible score for this homework is 100 points. Contents Download the HW1 Skeleton before you begin. ………………………………………………………………………… 1Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ………………………………………………………………………………………………………………………. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [40 points] Collect data from TMDb to build a co-actor network…………………………………………………… 3 Q2 [35 points] SQLite ……………………………………………………………………………………………………………….. 4 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species…………………………………………….. 7 Q4 [5 points] OpenRefine …………………………………………………………………………………………………………10 Q5 [5 points] Introduction to Python Flask …………………………………………………………………………………..12 2 Version 1Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [40 points] Collect data from TMDb to build a co-actor network Leveraging the power of APIs for data acquisition, you will build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, you will create a graph showing the relationships between actors based on their highly rated movies. This will not only highlight the practical application of APIs in collecting rich datasets, but also introduce the importance of graphs in understanding and visualizing the real-world dataset. Technology • Python 3.10.x • TMDb API version 3 Allowed Libraries The Python Standard Library and Requests only. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q1.py: The completed Python file • nodes.csv: The csv file containing nodes • edges.csv: The csv file containing edges Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval. Tasks and point breakdown 1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py. a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs. 2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of the TMDb API to download data about actors and their co-actors. To use the API: a. Create a TMDb account and follow the instructions on this document to obtain an API key. b. Be sure to use the key, not the token. This is the shorter of the two. c. Refer to the TMDB API Documentation as you work on this question. 3. [20 pts] Producing correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files. 4 Version 1 Q2 [35 points] SQLite SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files, which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions. You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite. Technology • SQLite release 3.37.2 • Python 3.10.x Allowed Libraries Do not modify import statements. Everything you need to complete this question has been imported for you. Do not use other libraries for this question. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q2.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. IMPORTANT NOTES: • If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf. • A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False. • In this question, you must only use INNER JOIN when performing a join between two tables, except for part g. Other types of joins may result in incorrect results. Tasks and point breakdown 1. [9 points] Create tables and import data. a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types: i. movies 1. id (integer) 2. title (text) 3. score (real) ii. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory. c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning. 5 Version 1 i. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (text) 4. popularity (real) 2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases. a. movie_index for the id column in movies table b. cast_index for the cast_id column in movie_cast table c. cast_bio_index for the cast_id column in cast_bio table 3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage. a. Output format and example value: 7.70 4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2 5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33 6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically. a. Exclude movies with score < 25 before calculating average_score. b. Include only cast members who have appeared in three or more movies with score >= 25. i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00 7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40. The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that 6 Version 1 you may have used for testing. Optional Reading: Why create views? 8. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. a. Order your output by collaboration_score in descending order, then by cast_name alphabetically. b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 9. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). a. [1 point] Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. NOTE: If you have issues that fts is not enabled, try the following steps • Go to sqlite3 downloads page: https://www.sqlite.org/download.html • Download the dll file for your system • Navigate to your Python packages folder, e.g., C:Users… …Anaconda3pkgssqlite-3.29.0- he774522_0Librarybin • Drop the downloaded .dll file in the bin. • In your IDE, import sqlite3 again, fts should be enabled. b. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc. ii. Output format and example value: 12 c. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’ ii. Output format and example value: 6 7 Version 1 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species In this question, you will utilize a dataset provided by TRAFFIC, an NGO working to ensure the global trade of wildlife is both legal and sustainable. TRAFFIC provides data through their interactive Wildlife Trade Portal, some of which we have already downloaded and pre-processed for you to utilize in Q3. Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse. Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., [email protected]). This reading provides an important foundation you will need for Homework 2. The question and autograder have been developed and tested for D3 version 5 (v5), while the book covers v4. What you learn from the book is transferable to v5, as v5 introduced few breaking changes. We also suggest briefly reviewing chapters 1-3 for background information on web development. TRAFFIC International (2024) Wildlife Trade Portal. Available at www.wildlifetradeportal.org. Technology • D3 Version 5 (included in the lib folder) • Chrome 97.0 (or newer): the browser for grading your code • Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. Deliverables • Q3.html: Modified file containing all html, javascript, and any css code required to produce the bar plot. Do not include the D3 libraries or q3.csv dataset. IMPORTANT NOTES: • Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. • We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work. • All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3, its header contains:. It is incorrect to use an absolute path such as:. The 3 files that are referenced are: a. lib/d3/d3.min.js b. lib/d3-dsv/d3-dsv.min.js c. lib/d3-fetch/d3-fetch.min.js • In your html / js code, use a relative path to read the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path must be “q3.csv” and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade, and you will lose points. As file paths are case-sensitive, ensure you correctly provide the relative path. • Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). Handle any data conversions that might be needed, e.g., strings that need to be converted to integer. See https://github.com/d3/d3-fetch#dsv. • VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout. Tasks and point breakdown Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications: 8 Version 1 1. [3.5 points] The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset. 2. [1 point] The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap. 3. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a element having the id: “x_axis” and the y-axis must be a element having the id: “y_axis”. 4. [2 points] Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a element having the id: “x_axis_label” and the y-axis label must be a element having the id: “y_axis_label”. 5. [2 points] Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot. 6. [2 points] Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot. 7. [1 point] Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e.,). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a element having the id: “title”. 8. [0.25 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot. The GT username must be a element having the id: “credit” 9. [0.25 points] Fill each bar with a unique color. We recommend using a colorblind-safe pallete. NOTE: Gradescope will render your plot using Chrome and present you with a Dropbox link to view the screenshot of your plot as the autograder sees it. This visual feedback helps you adjust and identify errors, e.g., a blank plot indicates a serious error. Your design does not need to replicate the solution plot. However, the autograder requires the following DOM structure (including using correct IDs for elements) and sizing attributes to know how your chart is built. 9 Version 1 plot | width: 900 | height: 370 | +– containing Q3.a plot elements | +– containing bars | +– x-axis | | | +– (x-axis elements) | +– x-axis label | +– y-axis | | | +– (y-axis elements) | +– y-axis label | +– GTUsername | +– chart title 10 Version 1 Q4 [5 points] OpenRefine OpenRefine is a powerful tool for working with messy data, allowing users to clean and transform data efficiently. Use OpenRefine in this question to clean data from Mercari. Construct GREL queries to filter the entries in this dataset. OpenRefine is a Java application that requires Java JRE to run. However, OpenRefine v.3.6.2 comes with a compatible Java version embedded with the installer. So, there is no need to install Java separately when working with this version. Go through the main features on OpenRefine’s homepage. Then, download and install OpenRefine 3.6.2. The link to release 3.6.2 is https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2 Technology • OpenRefine 3.6.2 Deliverables • properties_clean.csv: Export the final table as a csv file. • changes.json: Submit a list of changes made to file in json format. Go to ‘Undo/Redo’ Tab → ‘Extract’ → ‘Export’. This downloads ‘history.json’ . Rename it to ‘changes.json’. • Q4Observations.txt: A text file with answers to parts b.i, b.ii, b.iii, b.iv, b.v, b.vi. Provide each answer in a new line in the output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line. Tasks and point breakdown 1. Import Dataset a. Run OpenRefine and point your browser at https://127.0.0.1:3333. b. We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. c. Choose “Create Project” → This Computer → properties.csv. Click “Next”. d. You will now see a preview of the data. Click “Create Project” at the upper right corner. 2. [5 points] Clean/Refine the Data a. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. Output format and sample values: i.rows: 500 NOTE: OpenRefine maintains a log of all changes. You can undo changes by the “Undo/Redo” button at the upper left corner. You must follow all the steps in order and submit the final cleaned data file properties_clean.csv. The changes made by this step need to be present in the final submission. If they are not done at the beginning, the final number of rows can be incorrect and raise errors by the autograder. b. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 11 Version 1 NOTE: While multiple methods can split data, ensure new columns aren’t empty. Validate by sorting and checking for null values after using our suggested method in step b. c. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. d. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells → Transform). Provide the expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) NOTE: “Unknown” is case and space sensitive (“Unknown” is different from “unknown” and “Unknown “.) e. [0.5 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) f. [1.5 points] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase in the GREL expression before you search for the terms. Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) 12 Version 1 Q5 [5 points] Introduction to Python Flask Flask is a lightweight web application framework written in Python that provides you with tools, libraries, and technologies to build a web application quickly and scale it up as needed. In this question, you will build a web application that displays a table of TMDb data on a single-page website using Flask. You will modify the given file: wrangling_scripts/Q5.py Technology Python 3.10.x Flask Allowed Libraries Python standard libraries Libraries already imported in Q5.py Deliverables Q5.py: Completed Python file with your changes Tasks and point breakdown 1. username() – Update the username() method inside Q5.py by including your GT username. 2. Install Flask on your machine by running $ pip install Flask a. You can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped. 3. To run the code, navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command, go to http://127.0.0.1:3001/ on your browser. This will open up index.html, showing a table in which the rows returned by data_wrangling() are displayed. 4. You must solve the following two sub-questions: a. [2 points] Read and store the first 100 rows in a table using the data_wrangling() method. NOTE: The skeleton code, by default, reads all the rows from movies.csv. You must add the required code to ensure that you are reading only the first 100 data rows. The skeleton code already handles reading the table header for you. b. [3 points]: Sort this table in descending order of the values, i.e., with larger values at the top and smaller values at the bottom of the table in the last (3rd) column. Note that this column needs to be returned as a string for the autograder, but sorting may require float casting.CSE 6242 / CX 4242: Data and Visual Analytics HW 2: Tableau, D3 Graphs, and Visualization“Visualization gives you answers to questions you didn’t know you have” – Ben Schneiderman Download the HW2 Skeleton before you beginData visualization is an integral part of exploratory analysis and communicating key insights. This homework focuses on exploring and creating data visualizations using two of the most popular tools in the field; Tableau and D3.js. All 5 questions use data on the same topic to highlight the uses and strengths of different types of visualizations. The data comes from BoardGameGeek and includes games’ ratings, popularity, and metadata. Below are some terms you will often see in the questions: • Rating – a value from 0 to 10 given to each game. BoardGameGeek calculates a game’s overall rating in different ways including Average and Bayes, so make sure you are using the correct rating called for in a question. A higher rating is better than a lower rating. • Rank – the overall rank of a boardgame from 1 to n, with ranks closer to 1 being better and n being the total number of games. The rank may be for all games or for a subgroup of games such as abstract games or family games. The maximum possible score for this homework is 100 points. Students have the option to complete any 90 points’ worth of work to receive 100% (equivalent to 15 course total grade points) for this assignment. They can earn more than 100% if they submit additional work. For example, a student scoring 100 points will receive 111% for the assignment (equivalent to 16.67 course total grade points, as shown on Canvas). Download the HW2 Skeleton before you begin ……………………………………………………………………………….. 1Homework Overview …………………………………………………………………………………………………………………… 1 Important Notes………………………………………………………………………………………………………………………….. 2 Submission Notes ………………………………………………………………………………………………………………………. 2 Do I need to use the specific version of the software listed?………………………………………………………………. 2 Q1 [25 points] Designing a good table. Visualizing data with Tableau. ………………………………………………… 3 Important Points about Developing with D3 in Questions 2–5 ……………………………………………………………. 7 Q2 [15 points] Force-directed graph layout……………………………………………………………………………………… 8 Q3 [15 points] Line Charts……………………………………………………………………………………………………………10 Q4 [20 points] Interactive Visualization…………………………………………………………………………………………..14 Q5 [25 points] Choropleth Map of Board Game Ratings……………………………………………………………………18 2 Version 0Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as needed. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. a. Question 1 will be manually graded after the final HW due date and Grace Period. b. Questions 2-5 are auto graded at the time of submission. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 0 Q1 [25 points] Designing a good table. Visualizing data with Tableau. Goal Design a table, a grouped bar chart, and a stacked bar chart with filters in Tableau. Technology Tableau Desktop Deliverables Gradescope: After selecting HW2 – Q1, click Submit Images. You will be taken to a list of questions for your assignment. Click Select Images and submit the following four PNG images under the corresponding questions: ● table.png: Image/screenshot of the table in Q1.1 ● grouped_barchart.png: Image of the chart in Q1.2 ● stacked_barchart_1.png: Image of the chart in Q1.3 after filtering data for Max.Players = 2 ● stacked_barchart_2.png: Image of the chart in Q1.3 after filtering data for Max.Players = 4 a Q1 will be manually graded after the grace period. Setting Up Tableau Install and activate Tableau Desktop by following “HW2 Instructions” on Canvas. The product activation key is for your use in this course only. Do not share the key with anyone. If you already have Tableau Desktop installed on your machine, you may use this key to reactivate it. a If you do not have access to a Mac or Windows machine, use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your site and create a workbook a If neither of the above methods work, use Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Connecting to Data 1. It is optional to use Tableau for Q1.1. Otherwise, complete all parts using a single Tableau workbook. 2. Q1 will require connecting Tableau to two different data sources. You can connect to multiple data sources within one workbook by following the directions here. 3. For Q1.1 and Q1.2: a. Open Tableau and connect to a data source. Choose To a File – Text file. Select the popular_board_game.csv file from the skeleton. b. Click on the graph area at the bottom section next to “Data Source” to create worksheets. 4. For Q1.3: a. You will need a data.world account to access the data for Q1.3. Add a new data source by clicking on Data – New Data Source. b. When connecting to a data source, choose To a Server – Web Data Connector. c. Enter this URL to connect to the data.world data set on board games. You may be prompted to log in to data-world and authorize Tableau. If you haven’t used data.world before, you will be required to create an account by clicking “Join Now”. Do not edit the provided SQL query. a NOTE: If you cannot connect to data-world, you can use the provided csv files for Q1 in the skeleton. The provided csv files are identical to those hosted online and can be loaded directly into Tableau. a d. Click the graph area at the bottom section to create another worksheet, and Tableau will automatically create a data extract. 4 Version 0 Table and Chart Design 1. [5 points] Good table design. Visualize the data contained in popular_board_game.csv as a data table (known as a text table in Tableau). In this part (Q1.1), you can use any tool (e.g., Excel, HTML, Pandas, Tableau) to create the table. We are interested in grouping popular games into “support solo” (min player = 1) and “not support solo” (min player > 1). Your table should clearly communicate information about these two groups simultaneously. For each group (Solo Supported, Solo Not Supported), show: a a. Total number of games in each category (fighting, economic, …) b. In each category, the game with the highest number of ratings. If more than one game has the same (highest) number of ratings, pick the game you prefer. NOTE: Level of Detail expressions may be useful if you use Tableau. c. Average rating of games in each category (use simple average), rounded to 2 decimal places. d. Average playtime of games in each category, rounded to 2 decimal places. e. In the bottom left corner below your table, include your GT username (In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. If you use Tableau, refer to the tutorial here). f. Save the table as table.png. (If you use Tableau, go to Worksheet/Dashboard Export Image). NOTE: Do not take screenshots in Tableau since your image must have high resolution. You can take a screenshot If you use HTML, Pandas, etc. a Your learning goal here is to practice good table design, which is not strongly dependent on the tool that you use. Thus, we do not require that you use Tableau in this part. You may decide the most meaningful column names, the number of columns, and the column order. You are not limited to only the techniques described in the lecture. For OMS students, the lecture video on this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus students, review lecture slides 42 and 43. 2. [10 points] Grouped bar chart. Visualize popular_board_game.csv as a grouped bar chart in Tableau. Your chart should display game category (e.g., fighting, economic,…) along the horizontal axis and game count along the vertical axis. Show game playtime (e.g.,
CSE 6242 / CX 4242: Data and Visual Analytics HW 1: End-to-end analysis of TMDb data, SQLite, D3 Warmup, OpenRefine, FlaskVast amounts of digital data are generated each day, but raw data is often not immediately “usable”. Instead, we are interested in the information content of the data such as what patterns are captured? This assignment covers useful tools for acquiring, cleaning, storing, and visualizing datasets. In questions 1 & 2, we’ll perform a simple end-to-end analysis using data from The Movie Database (TMDb).We will collect movie data via API, store the data in csv files, and analyze data using SQL queries. For Q3, we will complete a D3 warmup to prepare our students for visualization questions in HW2. Q4 & 5 will provide an opportunity to explore other industry tools used to acquire, store, and clean datasets. The maximum possible score for this homework is 100 points. Contents Download the HW1 Skeleton before you begin. ………………………………………………………………………… 1Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ………………………………………………………………………………………………………………………. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [40 points] Collect data from TMDb to build a co-actor network…………………………………………………… 3 Q2 [35 points] SQLite ……………………………………………………………………………………………………………….. 4 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species…………………………………………….. 7 Q4 [5 points] OpenRefine …………………………………………………………………………………………………………10 Q5 [5 points] Introduction to Python Flask …………………………………………………………………………………..12 2 Version 1Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [40 points] Collect data from TMDb to build a co-actor network Leveraging the power of APIs for data acquisition, you will build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, you will create a graph showing the relationships between actors based on their highly rated movies. This will not only highlight the practical application of APIs in collecting rich datasets, but also introduce the importance of graphs in understanding and visualizing the real-world dataset. Technology • Python 3.10.x • TMDb API version 3 Allowed Libraries The Python Standard Library and Requests only. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q1.py: The completed Python file • nodes.csv: The csv file containing nodes • edges.csv: The csv file containing edges Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval. Tasks and point breakdown 1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py. a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs. 2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of the TMDb API to download data about actors and their co-actors. To use the API: a. Create a TMDb account and follow the instructions on this document to obtain an API key. b. Be sure to use the key, not the token. This is the shorter of the two. c. Refer to the TMDB API Documentation as you work on this question. 3. [20 pts] Producing correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files. 4 Version 1 Q2 [35 points] SQLite SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files, which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions. You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite. Technology • SQLite release 3.37.2 • Python 3.10.x Allowed Libraries Do not modify import statements. Everything you need to complete this question has been imported for you. Do not use other libraries for this question. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q2.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. IMPORTANT NOTES: • If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf. • A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False. • In this question, you must only use INNER JOIN when performing a join between two tables, except for part g. Other types of joins may result in incorrect results. Tasks and point breakdown 1. [9 points] Create tables and import data. a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types: i. movies 1. id (integer) 2. title (text) 3. score (real) ii. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory. c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning. 5 Version 1 i. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (text) 4. popularity (real) 2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases. a. movie_index for the id column in movies table b. cast_index for the cast_id column in movie_cast table c. cast_bio_index for the cast_id column in cast_bio table 3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage. a. Output format and example value: 7.70 4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2 5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33 6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically. a. Exclude movies with score < 25 before calculating average_score. b. Include only cast members who have appeared in three or more movies with score >= 25. i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00 7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40. The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that 6 Version 1 you may have used for testing. Optional Reading: Why create views? 8. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. a. Order your output by collaboration_score in descending order, then by cast_name alphabetically. b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 9. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). a. [1 point] Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. NOTE: If you have issues that fts is not enabled, try the following steps • Go to sqlite3 downloads page: https://www.sqlite.org/download.html • Download the dll file for your system • Navigate to your Python packages folder, e.g., C:Users… …Anaconda3pkgssqlite-3.29.0- he774522_0Librarybin • Drop the downloaded .dll file in the bin. • In your IDE, import sqlite3 again, fts should be enabled. b. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc. ii. Output format and example value: 12 c. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’ ii. Output format and example value: 6 7 Version 1 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species In this question, you will utilize a dataset provided by TRAFFIC, an NGO working to ensure the global trade of wildlife is both legal and sustainable. TRAFFIC provides data through their interactive Wildlife Trade Portal, some of which we have already downloaded and pre-processed for you to utilize in Q3. Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse. Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., [email protected]). This reading provides an important foundation you will need for Homework 2. The question and autograder have been developed and tested for D3 version 5 (v5), while the book covers v4. What you learn from the book is transferable to v5, as v5 introduced few breaking changes. We also suggest briefly reviewing chapters 1-3 for background information on web development. TRAFFIC International (2024) Wildlife Trade Portal. Available at www.wildlifetradeportal.org. Technology • D3 Version 5 (included in the lib folder) • Chrome 97.0 (or newer): the browser for grading your code • Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. Deliverables • Q3.html: Modified file containing all html, javascript, and any css code required to produce the bar plot. Do not include the D3 libraries or q3.csv dataset. IMPORTANT NOTES: • Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. • We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work. • All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3, its header contains:. It is incorrect to use an absolute path such as:. The 3 files that are referenced are: a. lib/d3/d3.min.js b. lib/d3-dsv/d3-dsv.min.js c. lib/d3-fetch/d3-fetch.min.js • In your html / js code, use a relative path to read the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path must be “q3.csv” and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade, and you will lose points. As file paths are case-sensitive, ensure you correctly provide the relative path. • Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). Handle any data conversions that might be needed, e.g., strings that need to be converted to integer. See https://github.com/d3/d3-fetch#dsv. • VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout. Tasks and point breakdown Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications: 8 Version 1 1. [3.5 points] The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset. 2. [1 point] The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap. 3. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a element having the id: “x_axis” and the y-axis must be a element having the id: “y_axis”. 4. [2 points] Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a element having the id: “x_axis_label” and the y-axis label must be a element having the id: “y_axis_label”. 5. [2 points] Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot. 6. [2 points] Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot. 7. [1 point] Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e.,). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a element having the id: “title”. 8. [0.25 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot. The GT username must be a element having the id: “credit” 9. [0.25 points] Fill each bar with a unique color. We recommend using a colorblind-safe pallete. NOTE: Gradescope will render your plot using Chrome and present you with a Dropbox link to view the screenshot of your plot as the autograder sees it. This visual feedback helps you adjust and identify errors, e.g., a blank plot indicates a serious error. Your design does not need to replicate the solution plot. However, the autograder requires the following DOM structure (including using correct IDs for elements) and sizing attributes to know how your chart is built. 9 Version 1 plot | width: 900 | height: 370 | +– containing Q3.a plot elements | +– containing bars | +– x-axis | | | +– (x-axis elements) | +– x-axis label | +– y-axis | | | +– (y-axis elements) | +– y-axis label | +– GTUsername | +– chart title 10 Version 1 Q4 [5 points] OpenRefine OpenRefine is a powerful tool for working with messy data, allowing users to clean and transform data efficiently. Use OpenRefine in this question to clean data from Mercari. Construct GREL queries to filter the entries in this dataset. OpenRefine is a Java application that requires Java JRE to run. However, OpenRefine v.3.6.2 comes with a compatible Java version embedded with the installer. So, there is no need to install Java separately when working with this version. Go through the main features on OpenRefine’s homepage. Then, download and install OpenRefine 3.6.2. The link to release 3.6.2 is https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2 Technology • OpenRefine 3.6.2 Deliverables • properties_clean.csv: Export the final table as a csv file. • changes.json: Submit a list of changes made to file in json format. Go to ‘Undo/Redo’ Tab → ‘Extract’ → ‘Export’. This downloads ‘history.json’ . Rename it to ‘changes.json’. • Q4Observations.txt: A text file with answers to parts b.i, b.ii, b.iii, b.iv, b.v, b.vi. Provide each answer in a new line in the output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line. Tasks and point breakdown 1. Import Dataset a. Run OpenRefine and point your browser at https://127.0.0.1:3333. b. We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. c. Choose “Create Project” → This Computer → properties.csv. Click “Next”. d. You will now see a preview of the data. Click “Create Project” at the upper right corner. 2. [5 points] Clean/Refine the Data a. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. Output format and sample values: i.rows: 500 NOTE: OpenRefine maintains a log of all changes. You can undo changes by the “Undo/Redo” button at the upper left corner. You must follow all the steps in order and submit the final cleaned data file properties_clean.csv. The changes made by this step need to be present in the final submission. If they are not done at the beginning, the final number of rows can be incorrect and raise errors by the autograder. b. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 11 Version 1 NOTE: While multiple methods can split data, ensure new columns aren’t empty. Validate by sorting and checking for null values after using our suggested method in step b. c. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. d. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells → Transform). Provide the expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) NOTE: “Unknown” is case and space sensitive (“Unknown” is different from “unknown” and “Unknown “.) e. [0.5 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) f. [1.5 points] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase in the GREL expression before you search for the terms. Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) 12 Version 1 Q5 [5 points] Introduction to Python Flask Flask is a lightweight web application framework written in Python that provides you with tools, libraries, and technologies to build a web application quickly and scale it up as needed. In this question, you will build a web application that displays a table of TMDb data on a single-page website using Flask. You will modify the given file: wrangling_scripts/Q5.py Technology Python 3.10.x Flask Allowed Libraries Python standard libraries Libraries already imported in Q5.py Deliverables Q5.py: Completed Python file with your changes Tasks and point breakdown 1. username() – Update the username() method inside Q5.py by including your GT username. 2. Install Flask on your machine by running $ pip install Flask a. You can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped. 3. To run the code, navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command, go to http://127.0.0.1:3001/ on your browser. This will open up index.html, showing a table in which the rows returned by data_wrangling() are displayed. 4. You must solve the following two sub-questions: a. [2 points] Read and store the first 100 rows in a table using the data_wrangling() method. NOTE: The skeleton code, by default, reads all the rows from movies.csv. You must add the required code to ensure that you are reading only the first 100 data rows. The skeleton code already handles reading the table header for you. b. [3 points]: Sort this table in descending order of the values, i.e., with larger values at the top and smaller values at the bottom of the table in the last (3rd) column. Note that this column needs to be returned as a string for the autograder, but sorting may require float casting.
CSE 6242 / CX 4242: Data and Visual Analytics HW 3: Spark, Docker, DataBricks, AWS and GCPDownload the HW3 Skeleton, Q1 Data, Q2 Data, and Q4 Data before you begin. Also, create an AWS Academy account as outlined in Step 1 of the AWS Setup GuideModern-day datasets are large. For example, the NASA Terra and Aqua satellites each produces over 300GB of satellite imagery daily. These datasets are too large for typical computer hard drives and requires advanced technologies for processing. In this assignment, you will work with a dataset of over 1 billion taxi trips from the New York City Taxi & Limousine Commission (TLC). Further details on this dataset are available here. This assignment aims to familiarize you with various tools that will be valuable for future projects, research, or career opportunities. By including AWS, Azure and GCP, we want to provide the opportunity to explore and compare these rapidly evolving platforms. This experience will help you make informed decisions when selecting a cloud platform in the future, allowing you to get started quickly and confidently. Many of the computational tasks in this assignment are straightforward, though quite a bit of “setup” will be needed before reaching the actual “programming” stage. Setting up work environments, launching clusters, monitoring compute usage, and running large-scale experiments on cloud platforms are important skills. This assignment familiarizes you with using machine clusters and understanding the pay-per-use model of most cloud services, offering a valuable first experience with cloud computing for many students. The maximum possible score for this homework is 100 points Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ……………………………………………………………………………………………………………………….. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [15 points] Analyzing trips data with PySpark…………………………………………………………………………… 3 Tasks and point breakdown…………………………………………………………………………………………………. 3 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks ………………………………………………………… 6 Tasks and point breakdown…………………………………………………………………………………………………. 7 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS…………………………………………….. 9 Tasks and point breakdown………………………………………………………………………………………………..10 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP………………………………………………………12 Tasks and point breakdown………………………………………………………………………………………………..13 Q5 [10 points] Regression: Automobile price prediction using Azure Machine Learning ……………………..14 Tasks and point breakdown………………………………………………………………………………………………..14 2 Version 1Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [15 points] Analyzing trips data with PySpark Follow these instructions to download and set up a preconfigured Docker image that you will use for this assignment. that you will use for this assignment. Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into many issues that could not be resolved satisfactorily. Docker allows us to distribute a cross-platform, preconfigured image with all the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to http://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container.You will use the yellow_tripdata_2019-01_short.csv dataset, a modified record of the NYC Green Taxi trips that includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values, unless specifically instructed to. Technology PySpark, Docker Deliverables [Gradescope] q1.ipynb: your solution as a Jupyter Notebook file IMPORTANT NOTES: • Only regular PySpark Dataframe Operations can be used. • Do NOT use PySpark SQL functions, i.e., sqlContext.sql(‘select * … ‘). We noticed that students frequently encountered difficult-to-resolve issues when using these functions. Additionally, since you already worked extensively with SQL in HW1, completing this task in SQL would offer limited educational value. • Do not reference sqlContext within the functions you are defining for the assignment. • If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing Spark context may cause errors. • Be sure to save your work often! If you do not see your notebook in Jupyter, then double check that the file is present in the folder and that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach, then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download often to save your work! • Do not add any cells or additional library imports to the notebook. • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. Tasks and point breakdown 1. [1 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types: a. passenger_count — integer b. total_amount — float c. tip_amount — float d. trip_distance — float e. fare_amount — float f. tpep_pickup_datetime — timestamp 4 Version 1 g. tpep_dropoff_datetime — timestamp 2. [4 pts] You will be modifying the function common_pair. Return the top 10 pickup-dropoff location pairs that have the highest sum of passenger_count who have traveled between them. Sort the location pairs by total passengers between pairs. For each location pair, also compute the average amount per passenger over all trips (name this per_person_rate), utilizing total_amount. For pairs with the same total passengers, sort them in descending order of per_person_rate. Filter out any trips that have the same pick-up and drop-off location. Rename the column for total passengers to total_passenger_count. Sample Output Format — The values below are for demonstration purposes: PULocationID DOLocationID total_passenger_count per_person_rate 1 2 23 5.242345 3 4 5 6.61345634 3. [4 pts] You will be modifying the function distance_with_most_tip . Filter the data for trips having fares (fare_amount) greater than $2.00 and a trip distance (trip_distance) greater than 0. Calculate the tip percent (tip_amount * 100 / fare_amount) for each trip. Round all trip distances up to the closest mile and find the average tip_percent for each trip_distance. Sort the result in descending order of tip_percent to obtain the top 15 trip distances which tip the most generously. Rename the column for rounded trip distances to trip_distance, and the column for average tip percents tip_percent. Sample Output Format — The values below are for demonstration purposes: trip_distance tip_percent 2 6.2632344561 1 4.42342882 4. [6 pts] You will be modifying the function time_with_most_traffic to determine which hour of the day has the most traffic. Calculate the traffic for a particular hour using the average speed of all taxi trips which began during that hour. Calculate the average speed as the average trip_distance divided by the average trip duration, as distance per hour. Make sure to determine the average durations and average trip distances before calculating the speed. It will likely be helpful to cast the dates to the long data type when determining the interval. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Additionally, you must separate the hours into AM and PM, with hours 0:00-11:59 being AM, and hours 12:00-23:59 being PM. Convert these times to the 12 hour time, so you can match the output below. For example, the row with 1 as time of day, should show the average speed between 1 am and 2 am in the am_avg_speed column, and between 1 pm and 2pm in the pm_avg_speed column. Use date_format along with the appropriate pattern letters to format the time of day so that it matches the example output below. Your final table should contain values sorted from 0-11 for time_of_day. There may be data missing for a time of day, and it may be null for am_avg_speed 5 Version 1 or pm_avg_speed. If an hour has no data for am or pm, there may be missing rows. You will not have rows for all possible times of day, and do not need to add them to the data if they are missing. Sample Output Format — The values below are for demonstration purposes: time_of_day am_avg_speed pm_avg_speed 1 0.953452345 9.23345272 2 5.2424622 null 4 null 2.55421905 6 Version 1 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data. You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-trip data dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driverreported passenger counts. Technology Spark/Scala, Databricks Deliverables [Gradescope] • q2.dbc: Your solution as Scala Notebook archive file (.dbc) exported from Databricks (see Databricks Setup Guide below) • q2.scala: Your solution as a Scala source file exported from Databricks (see Databricks Setup Guide below) • q2_results.csv: The output results from your Scala code in the Databricks q2 notebook file. You must carefully copy the outputs of the display()/show() function into a file titled q2_results.csv under the relevant sections. Please double-check and compare your actual output with the results you copied.IMPORTANT NOTES: • Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers. • Carefully follow the instructions in the Databricks Setup Guide. (You should have already downloaded the data needed for this question using the link provided before Homework Overview.) o You must choose the Databricks Runtime (DBR) version as “10.4 (includes Apache Spark 3.2.1, Scala 2.12)”. We will grade your work using this version. o Note that you do not need to install Scala or Spark on your local machine. They are provided with the DBR environment. • You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use the DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame. o After selecting the default language as SCALA, do not use the language magic % with other languages like %r, %python, %sql etc. The language magics are used to override the default language, which you must not do for this assignment. o You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE …”). This should be df.select(“*”) instead. • The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0). 1 Graph derived from the NYC Taxi and Limousine Commission 7 Version 1 o All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers. o Carefully read the instructions in the notebook, which provides hints for solving the problems. • Some tasks in this question have specified data types for the results that are of lower precision (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int). • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. • Hint: You may find some of the following DataFrame operations helpful: toDF, join, select, groupBy, orderBy, filter, agg, window(), partitionBy, orderBy, etc. Tasks and point breakdown 1. List the top 5 most popular locations for: a. [2 pts] dropoff based on “DOLocationID”, sorted in descending order by popularity. If there is a tie, then one with a lower “DOLocationID” gets listed first. b. [2 pts] pickup based on “PULocationID”, sorted in descending order by popularity. If there is a tie, then the one with a lower “PULocationID” gets listed first. 2. [4 pts] List the top 3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups. 3. [4 pts] List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with “Unknown” and “EWR”) and their total number of activities, in descending order of a total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). An example output format is shown below. 4. [5 pts] List the top 2 days of the week with the largest number of daily average pick-ups, along with the average number of pick-ups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3. An example output is shown below. 8 Version 1 Note: The day of week is a string of the day’s full spelling, e.g., “Monday” instead of the number 1 or “Mon”. Also, the pickup_datetime is in the format: yyyy-mm-dd 5. [6 pts] For each hour of a day (0 to 23, 0 being midnight) — in the order from 0 to 23 (inclusively), find the zone in the Brooklyn borough with the largest number of total pick-ups. Note: All dates for each hour should be included. 6. [7 pts] Find which 3 different days in the month of January, in Manhattan, that saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase. An example output is shown below. Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded. List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These preformatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct. • You must manually enter the output generated into the corresponding sections of the q2_results.csv file, preferably using some spreadsheet software like MS-Excel (but make sure to keep the csv format). For generating the output in the Scala notebook, refer to show() and display()functions of Scala. • Note that you can edit this csv file using text editor, but please be mindful about putting the results under designated columns. • If you encounter a “UnicodeDecodeError”, please save file as “.csv UTF-8″ to resolve. Note: Do NOT modify anything other than filling in those required output values in this csv file. We grade by running the Spark Scala code you write and by looking at your results listed in this file. So, make sure your output is obtained from the Spark Scala code you write. Failure to include the dbc and scala files will result in a deduction from your overall score. 9 Version 1 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on AWS, and to gain experience that will help you tackle more complex problems. The services you will primarily be using are Amazon S3 storage, Amazon Athena. You will be creating an S3 bucket, running code using Athena and its serverless PySpark engine, and then storing the output into that S3 bucket. Amazon Athena is serverless, meaning that it is pay for what you use. There are no servers to maintain that will accrue costs whether it’s being used or not. For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS Academy account, please post in the dedicated AWS Setup Ed Discussion thread. In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). You will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. Specifically, you will be working with two samples of this dataset, one small, and one much larger. Optionally, if you would like to learn more about the dataset, check out here and here; also optionally, you may explore the structure of the data by referring to [1] [2]. You are provided with a python notebook (q3.ipynb) file which you will complete and load into EMR. You are provided with the load_data() function, which loads two PySpark DataFrames. The first DataFrame, trips, contains trip data where each record refers to one (1) trip. The second DataFrame, lookup, maps a LocationID to its trip information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame. Technology PySpark, AWS Deliverables [Gradescope] • q3.ipynb: PySpark notebook for this question (for the larger dataset). • q3_output_large.csv: output file (comma-separated) for the larger dataset. IMPORTANT NOTES • Use Firefox, Safari or Chrome when configuring anything related to AWS. • EXTREMELY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation will incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra. • Strictly follow the guidelines below, or your answer may not be graded. a. Ensure that the parameters for each function remain as defined and the output order and names of the fields in the PySpark DataFrames are maintained. b. Do not import any functions which were not already imported within the skeleton. c. You must NOT round any numeric values. Rounding numbers can introduce inaccuracies. Our grader will be checking the first 8 decimal places of each value in the DataFrame. d. You will not have access to the Spark object directly in the autograder. If you use it in your functions, the autograder will fail! You can use the Spark Context from the DataFrame. 10 Version 1 e. Double check that you are submitting the correct files, and the filenames follow the correct naming standard — we only want the script and output from the larger dataset. Also, double check that you are writing the right dataset’s output to the right file. f. You are welcome to store your script’s output in any bucket you choose if you can download and submit the correct files. g. Do not make any manual changes to the output files. h. Please ensure that you do not remove #export from the HW skeleton; i. Do not import any additional packages, INCLUDING pyspark.sql.functions, as this may cause the autograder to work incorrectly. Everything you need should be imported for you. j. Using .rdd() can cause issues in the GradeScope environment. You can accomplish this assignment without it. In general, since the RDD API is outdated (though not deprecated), you should be wary of using this API. k. Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. l. Regular Pyspark Dataframe Operations and PySpark SQL operations can be used. To use PySpark SQL operations, you must use the SQL Context on the Spark Dataframe. Example: • df.createOrReplaceTempView(“some_table”) • df.sql_ctx.sql(“SELECT * FROM some_table”) Hints: a. Refer to DataFrame commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn. Documentation for the DataFrame APIs is located here. b. Testing on a single, small dataset (i.e. a “test case”) is helpful, but is not sufficient for discovering all potential issues, especially if such issues only become apparent when the code is run on larger datasets. It is important for you to develop more ways to review and verify your code logic. c. Overwriting the DataFrames from the function parameters can cause unintended side effects when it comes to rounding. Be sure to preserve the DataFrames in each function. d. Precision in data analytics is very important. Keep in mind that precision reduction in an earlier step can accumulate and be magnified, subsequently significantly affecting the final output’s precision (e.g., for a dataset with 1,000,000 data points, a 0.0001 difference for each data point can lead to a total difference of 100 over the whole dataset). This is called precision loss. Check out this post or hints on how to avoid precision loss. e. Check if you’re reducing the precision (or “scale”) too aggressively. Can you relax the restriction during intermediate steps? f. Make sure you return a DataFrame. If you get NoneType errors, you are most likely not returning what you think you are. g. Some columns may need to be cast to the right data type. Keep that in mind! Tasks and point breakdown Your objective is to locate profitable pick-up locations in Manhattan by analyzing taxi trip data (only trips 2 miles or longer). Follow the steps below to identify top pick-up locations based on a “weighted profit” calculation: 1. [0 pts] Setting up the AWS environment. a. Go through all the steps in the AWS Setup Guide. You should have already completed Step 1 to create your account) to set up your AWS environment, e.g., creating S3 storage bucket, and uploading skeleton file. 2. [1 pts] user() 11 Version 1 a. Returns your GT Username as a string (e.g., gburdell3) 3. [2 pts] long_trips(trips) a. This function filters trips to keep only trips 2 miles or longer (e.g., >= 2). b. Returns PySpark DataFrame with the same schema as trips c. Note: Parts 4, 5 and 6 will use the result of this function 4. [6 pts] manhattan_trips(trips, lookup) a. This function determines the top 20 locations with a DOLocationID in Manhattan by sum of passenger count. b. Returns a PySpark DataFrame (mtrips) with the schema (DOLocationID, pcount) c. Note: If you encounter the error ‘Can only compare identically labeled DataFrame objects,’ it is likely due to the use of the RDD API. We recommend avoiding the use of the RDD API since it is not compatible with the autograder. Instead, we suggest rewriting the logic using a join clause. 5. [6 pts] weighted_profit(trips, mtrips) a. This function determines i. the average total_amount, ii. the total count of trips, and iii. the total count of trips ending in the top 20 destinations. b. Using the above values, i. determine the proportion of trips that end in one of the popular drop-off locations (# trips that end in drop off location divided by total # of trips) and ii. multiply that proportion by the average total_amount to get a weighted_profit value based on the probability of passengers going to one of the popular destinations. iii. Return the weighted_profit c. Returns a PySpark DataFrame with the schema (PULocationID, weighted_profit) for the weighted_profit . 6. [5 pts] final_output(wp, lookup) a. This function i. takes the results of weighted_profit, ii. links it to the borough and zone through the lookup data frame, iii. and returns the top 20 locations with the highest weighted_profit. b. Returns a PySpark DataFrame with the schema (Zone, Borough, weighted_profit) c. Note: If you encounter issues with ‘3.5 Test Final Output,’ primarily due to the DataFrame returned from ‘final_output()’ containing incorrect data, it is essential to reformat column data types, particularly when applying ‘agg()’ operations in previous sections. Once you have implemented all these functions, run the main() function, which is already implemented, and update the line of code to include the name of your output s3 bucket and a location. This function will fail if the output directory already exists, so make sure to change it each time you run the function. Example: final.write.csv(‘s3://cse6242-gburdell3/output-large3’) Your output file will appear in a folder in your s3 bucket as a csv file with a name which is similar to part-0000- 4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output_large.csv for submission. Do NOT make any other changes to the file. 12 Version 1 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP The goal of this question is to familiarize you with creating storage buckets/clusters and running Spark programs on Google Cloud Platform. This question asks you to create a new Google Storage Bucket and load the NYC Taxi & Limousine Commission Dataset. You are also provided with a Jupyter Notebook q4.ipynb file, which you will load and complete in a Google Dataproc Cluster. Inside the notebook, you are provided with the skeleton for the load_data() function, which you will complete to load a PySpark DataFrame from the Google Storage Bucket you created as part of this question. Using this PySpark DataFrame, you will complete the following tasks using Spark DataFrame functions. You will use the data file yellow_tripdata09-08-2021.csv. The preceding link allows you to download the dataset you are required to work with for this question from the course DropBox. Each line represents a single taxi trip consisting of the comma-separated columns bulleted below. All columns are of string data type. You must convert the highlighted columns below into decimal data type (do NOT use float datatype) inside their respective functions when completing this question. Do not convert any datatypes within the load_data function. While casting to a decimal datatype, use a precision of 38 and a scale of 10. • vendorid • tpep_pickup_datetime • tpep_dropoff_datetime • passenger_count • trip_distance (decimal data type) • ratecodeid • store_and_fwd_flag • pulocationid • dolocationid • payment_type • fare_amount (decimal data type) • extra • mta_tax • tip_amount (decimal data type) • tolls_amount (decimal data type) • improvement_surcharge • total_amount Technology Spark, Google Cloud Platform (GCP) Deliverables [Gradescope] q4.ipynb: the PySpark notebook for this question. IMPORTANT NOTES: • Use Firefox, Safari or Chrome when configuring anything related to GCP. • Strictly follow the guidelines below, or your answer may not be graded. o Regular PySpark Dataframe Operations can be used. o Do NOT use any functions from the RDD API or your code will break the autograder. In general, the RDD API is considered outdated, so you should use the DataFrame API for better performance and compatibility. o Make sure to download the notebook from your GCP cluster before deleting the GCP cluster (otherwise, you will lose your work). o Do not add new cells to the notebook, as this may break the auto-grader. o Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. o Do not use any .rdd function in your code. Not only will this break the autograder, but you should 13 Version 1 be wary of using this function in general. o Ensure that you are only submitting a COMPLETE solution to Gradescope. Anything less will break the autograder. Write local unit tests to help test your code. Tasks and point breakdown 1. [0 pts] Set up your GCP environment a. Instructions to set up GCP Credits, GCP Storage and Dataproc Cluster are provide here: written instructions. b. Helpful tips/FAQs for special scenarios: i. If GCP service is disabled for your google account, try the steps in this google support link ii. If you have any issues with the GCP free credits, please post in the dedicated GCP Setup Ed Discussion thread. 2. [0 pts — required] Function load_data()to load data from a Google Storage Bucket into a Spark DataFrame a. You must first perform this task (part 2) BEFORE performing parts 3, 4, 5, 6 and 7. No points are allocated to task 2, but it is essential that you correctly implement the load_data()function as the remaining graded tasks depend upon this task and its correct implementation. Upload code to Gradescope ONLY after completing all tasks and removing/commenting all the testing code. Anything else will break the autograder. 3. [2 pts] Function exclude_no_pickup_locations() to exclude trips with no pick-up locations (pick-up location id column is null or is zero) in the original data from a. 4. [2 pts] Function exclude_no_trip_distance() to exclude trips with no distance (i.e., trip distance column is null or zero) in the dataframe output by exclude_no_pickup_locations(). 5. [2 pts] Function include_fare_range() to include trips with fare from $20 (inclusively) to $60 (inclusively) in the dataframe output by exclude_no_trip_distance(). 6. [2 pts] Function get_highest_tip() to identify the highest tip (rounded to 2 decimal places) in the dataframe output by include_fare_range(). 7. [2 pts] Function get_total_toll() to calculate the total toll amount (rounded to 2 decimal places) in the dataframe output by include_fare_range(). 14 Version 1 Q5 [10 points] Regression: Automobile price prediction using Azure Machine Learning The primary purpose of this question is to introduce you to Microsoft Machine Learning Studio by familiarizing you to its basic functionalities and machine learning workflows. Go through the Automobile Price Prediction tutorial and create/run ML experiments to complete the following tasks. You will not incur any cost if you save your experiments on Azure till submission. Once you are sure about the results and have reported them, feel free to delete your experiments. You will manually modify the given file q5.csv by adding the results using a plain text editor from the following tasks. Technology Azure Machine Learning Deliverables [Gradescope] q5.csv: a csv file containing results for all parts IMPORTANT NOTES: • Strictly follow the guidelines below, or your answer may not be graded. o DO NOT change the order of the questions. o Report the exact numerical values that you get in your output. DO NOT round any of them. o When manually entering a value into the csv file, append it immediately after a comma, so there will be NO space between the comma and your value, and no trailing spaces or commas after your value. o Follow the tutorial and do not change values for L2 regularization. For tasks 3 and 4, select the columns given in the tutorial. Tasks and point breakdown 1. [0 pts] Create and use a free workspace instance on Azure Machine Learning. Use your Georgia Tech username (e.g., jdoe3) to login. 2. [0 pts] Update q5.csv by replacing gburdell3 with your GT username. 3. [3 pts] Repeat the experiment described in the tutorial and report values of all metrics as mentioned in the Evaluate Model section of the tutorial. Make sure the Split Data looks as it does below: 4. [3 pts] Repeat the experiment mentioned in task 3 with a different value of Fraction of rows in the first output dataset in the Split Data module. Change the value to 0.8 from the originally set value of 0.7. Report corresponding values of the metrics. 15 Version 1 5. [4 pts] After fully completing tasks 3 and 4, run a new experiment — evaluate the model using 5- fold cross-validation CV. a. Select parameters in the Partition and Sample component in accordance with the figure below. b. For Cross Validate model set the column name as “price” for CV and use 0 as a random seed. c. Report the values of Root Mean Squared Error (RMSE) and Coefficient of Determination for each of the five folds (1st fold corresponds to fold number 0 and so on). Do NOT round the results. Report exact values. d. HINT: to see results, right click Cross Validate Model and select Preview data Evaluation results by fold. Make sure to utilize the same data cleaning/processing steps as you did before. Figure: Property Tab of Partition and Sample Module
CSE 6242 / CX 4242: Data and Visual Analytics HW 2: Tableau, D3 Graphs, and Visualization“Visualization gives you answers to questions you didn’t know you have” – Ben Schneiderman Download the HW2 Skeleton before you beginData visualization is an integral part of exploratory analysis and communicating key insights. This homework focuses on exploring and creating data visualizations using two of the most popular tools in the field; Tableau and D3.js. All 5 questions use data on the same topic to highlight the uses and strengths of different types of visualizations. The data comes from BoardGameGeek and includes games’ ratings, popularity, and metadata. Below are some terms you will often see in the questions: • Rating – a value from 0 to 10 given to each game. BoardGameGeek calculates a game’s overall rating in different ways including Average and Bayes, so make sure you are using the correct rating called for in a question. A higher rating is better than a lower rating. • Rank – the overall rank of a boardgame from 1 to n, with ranks closer to 1 being better and n being the total number of games. The rank may be for all games or for a subgroup of games such as abstract games or family games. The maximum possible score for this homework is 100 points. Students have the option to complete any 90 points’ worth of work to receive 100% (equivalent to 15 course total grade points) for this assignment. They can earn more than 100% if they submit additional work. For example, a student scoring 100 points will receive 111% for the assignment (equivalent to 16.67 course total grade points, as shown on Canvas). Download the HW2 Skeleton before you begin ……………………………………………………………………………….. 1Homework Overview …………………………………………………………………………………………………………………… 1 Important Notes………………………………………………………………………………………………………………………….. 2 Submission Notes ………………………………………………………………………………………………………………………. 2 Do I need to use the specific version of the software listed?………………………………………………………………. 2 Q1 [25 points] Designing a good table. Visualizing data with Tableau. ………………………………………………… 3 Important Points about Developing with D3 in Questions 2–5 ……………………………………………………………. 7 Q2 [15 points] Force-directed graph layout……………………………………………………………………………………… 8 Q3 [15 points] Line Charts……………………………………………………………………………………………………………10 Q4 [20 points] Interactive Visualization…………………………………………………………………………………………..14 Q5 [25 points] Choropleth Map of Board Game Ratings……………………………………………………………………18 2 Version 0Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as needed. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. a. Question 1 will be manually graded after the final HW due date and Grace Period. b. Questions 2-5 are auto graded at the time of submission. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 0 Q1 [25 points] Designing a good table. Visualizing data with Tableau. Goal Design a table, a grouped bar chart, and a stacked bar chart with filters in Tableau. Technology Tableau Desktop Deliverables Gradescope: After selecting HW2 – Q1, click Submit Images. You will be taken to a list of questions for your assignment. Click Select Images and submit the following four PNG images under the corresponding questions: ● table.png: Image/screenshot of the table in Q1.1 ● grouped_barchart.png: Image of the chart in Q1.2 ● stacked_barchart_1.png: Image of the chart in Q1.3 after filtering data for Max.Players = 2 ● stacked_barchart_2.png: Image of the chart in Q1.3 after filtering data for Max.Players = 4 a Q1 will be manually graded after the grace period. Setting Up Tableau Install and activate Tableau Desktop by following “HW2 Instructions” on Canvas. The product activation key is for your use in this course only. Do not share the key with anyone. If you already have Tableau Desktop installed on your machine, you may use this key to reactivate it. a If you do not have access to a Mac or Windows machine, use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your site and create a workbook a If neither of the above methods work, use Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Connecting to Data 1. It is optional to use Tableau for Q1.1. Otherwise, complete all parts using a single Tableau workbook. 2. Q1 will require connecting Tableau to two different data sources. You can connect to multiple data sources within one workbook by following the directions here. 3. For Q1.1 and Q1.2: a. Open Tableau and connect to a data source. Choose To a File – Text file. Select the popular_board_game.csv file from the skeleton. b. Click on the graph area at the bottom section next to “Data Source” to create worksheets. 4. For Q1.3: a. You will need a data.world account to access the data for Q1.3. Add a new data source by clicking on Data – New Data Source. b. When connecting to a data source, choose To a Server – Web Data Connector. c. Enter this URL to connect to the data.world data set on board games. You may be prompted to log in to data-world and authorize Tableau. If you haven’t used data.world before, you will be required to create an account by clicking “Join Now”. Do not edit the provided SQL query. a NOTE: If you cannot connect to data-world, you can use the provided csv files for Q1 in the skeleton. The provided csv files are identical to those hosted online and can be loaded directly into Tableau. a d. Click the graph area at the bottom section to create another worksheet, and Tableau will automatically create a data extract. 4 Version 0 Table and Chart Design 1. [5 points] Good table design. Visualize the data contained in popular_board_game.csv as a data table (known as a text table in Tableau). In this part (Q1.1), you can use any tool (e.g., Excel, HTML, Pandas, Tableau) to create the table. We are interested in grouping popular games into “support solo” (min player = 1) and “not support solo” (min player > 1). Your table should clearly communicate information about these two groups simultaneously. For each group (Solo Supported, Solo Not Supported), show: a a. Total number of games in each category (fighting, economic, …) b. In each category, the game with the highest number of ratings. If more than one game has the same (highest) number of ratings, pick the game you prefer. NOTE: Level of Detail expressions may be useful if you use Tableau. c. Average rating of games in each category (use simple average), rounded to 2 decimal places. d. Average playtime of games in each category, rounded to 2 decimal places. e. In the bottom left corner below your table, include your GT username (In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. If you use Tableau, refer to the tutorial here). f. Save the table as table.png. (If you use Tableau, go to Worksheet/Dashboard Export Image). NOTE: Do not take screenshots in Tableau since your image must have high resolution. You can take a screenshot If you use HTML, Pandas, etc. a Your learning goal here is to practice good table design, which is not strongly dependent on the tool that you use. Thus, we do not require that you use Tableau in this part. You may decide the most meaningful column names, the number of columns, and the column order. You are not limited to only the techniques described in the lecture. For OMS students, the lecture video on this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus students, review lecture slides 42 and 43. 2. [10 points] Grouped bar chart. Visualize popular_board_game.csv as a grouped bar chart in Tableau. Your chart should display game category (e.g., fighting, economic,…) along the horizontal axis and game count along the vertical axis. Show game playtime (e.g.,
(A) The first part of your assignment is to code up the algorithms in Chapter 21 of the 3rd edition of the textbook (Chapter 19 of the 4th), which is on Data Structures for Disjoint Sets.(1) Code up the Make-Set, Find-Set and Union operations described in Section 21.2, using the weighted-union heuristic. Set up experiments to get empirical evidence for the statement of Theorem 21.1.(2) Code up the Make-Set, Find-Set and Union operations, described in Section 21.3, using the union-by-rank and path compression heuristics. Set up experiments to get empirical evidence for the claim made on page 572 that the worst run-time is O(m α(n)).(B) The second part of your assignment is to give a detailed solution to exercise 21.3-4 of the 3rd edition of the textbook.You must submit the following: (i) Your source code for the algorithms (using C, C++ or Java). (ii) A document describing the experiments that you ran for part (A), and graphs illustrating the results. (iii) A document with the solution to the exercise in part (B). You must submit your files to the AA Moodle page by Tuesday, 22 October, at 23h00.