Exercise 4 Due: See website for due date. What to submit: See website. The theme of this exercise is automatic memory management, leak detection, and virtual memory. 1. Understanding valgrind’s leak checker Valgrind is a tool that can aid in finding memory leaks in C programs. To that end, it per- forms an analysis similar to the “mark” phase of a traditional mark-and-sweep garbage collector right before a program exits and identifies still reachable objects and leaks. Note that at this point, the program’s main function has already returned, so any local variables defined in it have already gone out of scope. For leaked (or lost) objects, it uses the definition prevalent for C programs: these are objects that have been allocated but not yet freed, and there is no possible way for a legal program to access them in the future. Read Section 4.2.8 Memory leak detection in the Valgrind Manual [URL] and then con- struct a C program leak .c that, when run with valgrind --leak-check=full --show-leak-kinds=all . /leak produces the following output: ==44047== Memcheck, a memory error detector ==44047== Copyright (C) 2002-2024, and GNU GPL’d, by Julian Seward et al . ==44047== Using Valgrind-3 .23 . 0 and LibVEX; rerun with -h for copyright info ==44047== Command: . /leak ==44047== Parent PID: 43753 ==44047== ==44047== ==44047== HEAP SUMMARY: ==44047== in use at exit: 48 bytes in 6 blocks ==44047== total heap usage: 6 allocs, 0 frees, 48 bytes allocated ==44047== ==44047== 8 bytes in 1 blocks are still reachable in loss record 1 of 6 ==44047== at 0x484482F: malloc (vg_replace_malloc .c:446) ==44047== by 0x401138: main (leak .c:11) ==44047== ==44047== 8 bytes in 1 blocks are still reachable in loss record 2 of 6 ==44047== at 0x484482F: malloc (vg_replace_malloc .c:446) ==44047== by 0x401150: main (leak .c:12) ==44047== ==44047== 8 bytes in 1 blocks are still reachable in loss record 3 of 6 ==44047== at 0x484482F: malloc (vg_replace_malloc .c:446) ==44047== by 0x401167: main (leak .c:13) ==44047== ==44047== 8 bytes in 1 blocks are indirectly lost in loss record 4 of 6 ==44047== at 0x484482F: malloc (vg_replace_malloc .c:446) ==44047== by 0x401182: main (leak .c:16) ==44047== ==44047== 8 bytes in 1 blocks are indirectly lost in loss record 5 of 6 ==44047== at 0x484482F: malloc (vg_replace_malloc .c:446) ==44047== by 0x40119D: main (leak .c:17) ==44047== ==44047== 24 (8 direct, 16 indirect) bytes in 1 blocks are definitely lost in loss record 6 of 6 ==44047== at 0x484482F: malloc (vg_replace_malloc .c:446) ==44047== by 0x401174: main (leak .c:15) ==44047== ==44047== LEAK SUMMARY: ==44047== definitely lost: 8 bytes in 1 blocks ==44047== indirectly lost: 16 bytes in 2 blocks ==44047== possibly lost: 0 bytes in 0 blocks ==44047== still reachable: 24 bytes in 3 blocks ==44047== suppressed: 0 bytes in 0 blocks ==44047== ==44047== For lists of detected and suppressed errors, rerun with: -s ==44047== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) (the line numbers in your reconstruction need not match, but the LEAK summary should (including the number of blocks and number of bytes shown.) 2. Reverse Engineering A Memory Leak In this part of the exercise, you will be given a post-mortem dump of a JVM’sheap that was obtained when running a program with a memory leak. The dump was produced at the point in time when the program ran out of memory because its live heap size exceeded the maximum, which can be accomplished as shown in this log: $ java -XX:+HeapDumpOnOutOfMemoryError -Xmx64m OOM java .lang .OutOfMemoryError: Java heap space Dumping heap to java_pid2353427 .hprof . . . Heap dump file created [89551060 bytes in 0 .379 secs] Exception in thread "main" java .lang .OutOfMemoryError: Java heap space at java .base/java .util .Arrays .copyOf(Arrays . java:3720) at java .base/java .util .Arrays .copyOf(Arrays . java:3689) at java .base/java .util .PriorityQueue .grow(PriorityQueue . java:305) at java .base/java .util .PriorityQueue .offer(PriorityQueue . java:344) at OOM .main(OOM . java:19) Your task is to examine the heap dump (oom.hprof) and reverse engineer the leaky pro- gram. To that end, you must install the Eclipse Memory Analyzer on your computer. It can be downloaded from this URL. Open the heap dump. Requirements • Your program must run out of memory when run as shown above. You should double-check that the created heap dump matches the provided dump, where “matches” is defined as follows. • The structure of the reachability graph of the subcomponent with the largest re- tained size should be similar in your heap dump as in the provided heap dump. (Other information such as the content of arrays may differ.) • You will need to write one or more classes and write code that allocates these objects and creates references between them. You should choose the same field and class names in your program as in the heap dump, and no extra ones (we will check this). Think of field names as edge labels in the reachability graph. • You should investigate which classes from Java’s standard library are involved in the leak. Hints • The program that was used to create the heap dump is 22 lines long (without com- ments,and including the main function), though your line numbers may differ. • Static inner classes are separated with a dollar sign $. For instance, A$B is the name of a static inner class called B nested in A. (Your solution should use the same class names as in the heap dump.) • Start with the “Leak Suspects” report, then look in Details. Use the “List Objects ... withoutgoing references” feature to find a visualization of the objects that were part of the heap when the program ran out of memory. • The “dominator tree” option can also give you insight into the structure of the object graph. Zoom in on the objects that have the largest “Retained Heap” quantity. • Use the Java Tutor website to write small test programs and trace how the reacha- bility graph changes over time. • Do not forget the -Xmx64m switch when running your program, or else your pro- gram may run for several minutes before running out of memory, even if imple- mented correctly. (If implemented incorrectly, it will run forever.) • Do not access the oom .hprof file through a remote file system path such as a mapped Google drive or similar. Students in the past have reported runtime errors in Eclipse MAT when trying to do that. Instead, copy it to your local computer’s file system first as a binary file. The SHA256 sum ofoom .hprof is 04df06c33e684cc8b0c4e278176ccca885d0abd71fb506e29ad25d8c331a1efa 3. Using mmap to list the entries in a ZIP file Write a short program zipdir that displays the list of entries inside a zip file whose name is passed to the program as its first argument. For each entry it should also print its com- pression ratio as a percentage rounded to the nearest tenth of a percent. The compression ratio is defined as the ratio of the compressed size to the uncompressed size. A sample use would be: $ . /zipdir heap . zip heap1 .dot 41 . 9% heap1 .in 66 . 0% heap1 .out 100 . 0% heap1 .png 89 .4% heap2 .dot 36 . 8% heap2 .in 59 .5% heap2 .out 100 . 0% heap2 .png 92 .5% heap3 .dot 37 . 0% heap3 .in 59 .2% heap3 .out 100 . 0% heap3 .png 92 . 0% Your program should use only the open(2), fstat(2), and mmap(2) system calls (plus any system calls needed to output the result, such as write(1) via printf. Do not use read(2) (or higher-level functions such as fread(3), etc. that call read() internally). The ZIP file format is described, among other places, on the WikiPedia page https://en.wikipedia.org/wiki/ZIP_(file_format) Use the following algorithm: • open the file with open(2) in read-only mode. • use fstat(2) to determine the length of the file. • use mmap(2) to map the entire file into memory in a read-only way. • scan from the back of the file until you find the beginning marker of the End of Central Directory Record (EOCD). • extract the number of central directory records in this zip archive and the start offset of the central directory. • Then, starting from the start offset of the central directory, examine each central directory file header and output the filename contained in it, along with the com- pression ratio. (Hint: use the following format string for printf: printf ("%-25 . *s %5 . 1f%% ", namelength, name, ratio); • Skip forward to the next central directory record by advancing 46 + m + n + k bytes where n is the length of the filename, m is the extra field length, and k is the file comment length contained in each central file directory header. Simplifying assumptions/hints: • All multibyte integers in a ZIP file are stored in little-endian order, and — for the purposes of this exercise — you may assume that the host byte order of the machine on which your program runs is little endian as well. • You may use pointer arithmetic on void * pointers, which uses a stride of 1 byte (i.e., it assumes that sizeof(void)==1. To access 16-bit or 32-bit values, use uint16_t * and uint32_t *, respectively under the assumption of little endian host byte order. • If the given file is not a well-formed ZIP archive then the behavior. of your program can be undefined. • Be sure to handle empty zip files that have no entries.
Assignment 3 Note 1: for answers with Python, display both codes and results clearly. Note 2: for answers with manual calculation, please display all calculation steps clearly. Question 1. [30 points @ 6 points each] A firm collected 5 training instances with 2 features X1 and X2 and their types. Instance X1 X2 Type 1 12.1 11.7 + 2 7.9 2.1 × 3 7.8 8.4 + 4 7.3 6.9 × 5 11.2 8.9 + (a) Use Python to plot the 5 instances with X1 on the x-axis and X2 on the y-axis. Visualize instances with different color according to their Type values. With a new instance (i.e., Instance 6) with (X1, X2) = (6.5, 2. 1), please complete tasks below with either Python or manual calculation. Round all results to 4 decimal places. (b) Calculate the Euclidean distance between the new instance and each of the 5 training instances using both X1 and X2 . (c) Calculate their Cosine distance as well. (d) What is the predicted Type value for the new instance using 3-NN algorithm and majority vote (based on cosine distance)? (e) What’s the predicted Type value for the new instance using 3-NN algorithm and weighted voting (based on cosine distance computed at step (c))? What is the estimated class probability for it? Please report the results in one or two tables. For example, answers for Q1(b) -(c) can be organized as below: Instance X1 X2 Type (b) Euclidean Distance (c) Cosine Distance 1 12.1 11.7 + … … … … 6 6.5 2.1 Question 2. [30 points] A firm collected 6 instances with 2 features X1 and X2 . Instance X1 X2 1 1 4 2 1 3 3 0 5 4 5 2 5 6 3 6 4 0 With instance 1 and 4 selected as the initial centroids, we’d like to simulate the k-means algorithm to separate all instances into two clusters (k = 2). Please complete below tasks with either Python or manual calculation, round results to 2 decimal places. (a) [5 points] Compute Euclidean distance from each instance to the two centroids. (b) [5 points] Assign instances to the two clusters by finding their closest centroids. (c) [5 points] Compute the clustering quality with SSE = ∑i(k)= 1 ∑p∈ci d(p, mi )2 . (Note: d(p, mi ) is Euclidean Distance between instance p & its centroid mi.) (d) [5 points] Compute the mean feature values for instances in the two clusters respectively, in the format of (X1, X2 ). (e) [10 points] Update the two cluster centroids with the mean feature values calculated in step (d), then repeat step (a) – (d) once. Will the clustering result (i.e., cluster allocation) change? Any improvement in SSE? Please report the results in one or two tables. For example, answers for Q2(a)-(d) can be reported in below table. Instance X1 X2 (a) Distance to Instance 1 (a) Distance to Instance 4 (b) Cluster Label (d) Updated Centroid 1 1 4 2 1 3 … … … 6 4 0 (c) SSE: Question 3. [24 points] A bank trained a classification model to predict the likelihood of default for each customer. There are 1000 customers in the database: the “No Default” cases take up 80% of the data while the “Default” cases take up 20%. Applying this classifier on this dataset yields below confusion matrix. Predicted Class Default No Default Actual Class Default 150 50 No Default 100 700 As the average lending amount is $100 and interest rate is 10%, the cost-benefit matrix (negative numbers means cost) is: Predicted Class Default No Default Actual Class Default 0 -$100 No Default 0 $10 (a) [4 points] Which group (“Default” or “No Default”) will you consider as the positive class? (b) [8 points @ 2 points each] Calculate the followings score for this model: (i) Accuracy (ii) True positive rate (Sensitivity) (iii) True negative rate (Specificity) (iv) Precision (for the positive class only) (c) [4 points] Calculate the expected value (per person) for this model. (d) [4 points] Assume we aim to target the same proportion of customers as in the first table, with only positive predictions will be targeted. Write down the confusion matrix for a random classifier. (e) [4 points] Calculate the overall expected value (per person) for the random classifier in step (d). Question 4. [16 points] Two classifiers (Model A and B) are used to predict the probability of increase in the Fed Funds rate (i.e., increase vs. no increase), with each quarter considered as an instance. The predicted increase probabilities over the past 6 quarters (instances) are displayed in the following table: Quarter Actual Model A Model B 1 1 0.43 0.63 2 1 0.52 0.53 3 1 0.85 0.56 4 1 0.69 0.71 5 0 0.03 0.18 6 0 0.31 0.76 Please visualize below points with either Python or manually. (a) [12 points] Plot the ROC curve for the 2 classifiers and the random classifier. Please calculate the TP and FP rates with the following cutoff values: [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1]. (Note: you may need to calculate the TP and FPrates for each cut-offmanually. The visualization can be done with either manually or with Python. ) (b) [4 points] Which model is better? Why?
UNIT CODE: ACFIM0005 UNIT NAME: Quantitative Methods, Big Data, and Machine Learning October 2024 Overview . The coursework represents 40% of the final mark for the unit. • The coursework is in the form. of a report. The word limit for the written part of the assignment is 3,000 words (excluding tables, references, and appendices). Please note that this is the words limits, not a target. Output from Python, including charts or tables, can be pasted into the report. • The code is required to attach in the Appendix, ideally with some necessary comments. • The coursework is a group work –You need to arrange yourselves into groups of 3 or 4 people (groups smaller than 3 or larger than 4 are not permitted). All members of a given group will receive the same mark, and it is up to you to determine the allocation of work within the group and to ensure that all group members make a valid contribution. All members of a group should be happy with the whole submission, as you assume joint responsibility. • Penalties will apply if the coursework is submitted late. Coursework requirement 1. The following questions are based on the data used in"Empirical Asset Pricing via Machine Learning" by Shihao Gu, Bryan Kelly, and Dacheng Xiu (Review of Financial Studies, Vol. 33, Issue 5, (2020), 2223-2273), henceforth GKX. You can download the GKX data directly from Dacheng Xiu’s website: https://dachxiu.chicagobooth.edu/, which contains 94 firm characteristics. Merge your data with the monthly stock return file on Blackboard using the PERMNO and date columns. Please include your code in the appendix with comments. a. Randomly select 500 stocks from the dataset and define your own sample period , extracting their market beta and constructing the one-month ahead return. Cross-sectionally rank those stocks into decile portfolios based on their market betas in every month. Report the average one-month ahead return performance of each beta-sorted portfolio. Are those returns statistically significant from zero? [5] [Hint 1: You can find the variable definitions in the internet appendix of the paper, which is also available on Dacheng Xiu’swebsite.] [Hint 2: Make sure the randomly selected stock samples can be reproduceable.] b. Please find the market index return and risk-free rate for the U.S. market. Calculate the excess return for each beta-sorted portfolio and perform. an OLS regression against the market excess return and obtain the full-sample beta coefficient. What do you observe? i. Does a low-beta portfolio report a low beta coefficient, whereas a high-beta portfolio reports a high beta coefficient? What about their t-values and R-squared? ii. Does a low-beta portfolio on average earn lower returns, whereas a high-beta portfolio earns higher returns? Please analyse the statistics. [5] c. Plot all the portfolios’ full-sample average excess returns against their beta coefficients in a graph. Find a line fitted using ordinary least square estimates. What do you observe? [10] i. Is the line positively sloped, negatively sloped, or flat? ii. Is the intercept significantly positive, significantly negative, or insignificant from zero? Please use regressions to provide statistical test. d. Split your sample into two based on whether the market excess return is above or below the full- sample median. Repeat the 1.c exercise for each subsample and discuss your findings. [5] e. Test for the validity of CAPM at the individual stock level. [10] EXRet i,t+1 = α + β ∗ BETAi,t + Y ∗ xi,t + εi,t+1 i. For each stock i in each month t , conduct a Fama-MacBeth (FM) cross-sectional average regression using the above equation and include additional stock (firm)-level control variables of xi,t. [Hint: you need to justify the inclusion of the selected control variables.] ii. Repeat the exercise but split the sample into two based on whether the market excess return is above or below the full-sample median. What do you observe and discuss your findings. 2. Using the same dataset as of Q1 but now include all the stocks, your goal is to predict the one month ahead returns by training different ML models using the large pool of 94 firm characteristics (20 of the characteristics in the GKX data have monthly frequency, while the rest are either quarterly or annual). a. Choose all 20 of the monthly characteristics and add 10 other quarterly/annual characteristics of your choosing to obtain a list of 30 predictive features. Report the summary statistics for the features in your list and give a brief definition for each. [5] b. Pre-process the predictive features by applying the rank normalization technique described in the paper (see section 2.1 footnote 29). [5] c. Train two different ML models using Partial Least Squares (PLS), and Random Forest (RF) to predict one-month-ahead returns. Your out-of-sample testing period should start in January 1990 (i.e., the first out-of-sample prediction should be for February 1990) and end in November 2021 (last prediction is for December 2021). Be as explicit as possible while describing your training methodology. Make sure that there is no forward-looking bias (i.e., leakage). Choose appropriate metrics to measure the model fit and report both in-sample and out-of-sample performance. Compare your results across the two models. [30] d. Compare the out-of-sample performance between 1990-1999 to the 2000-2021 performance. Is there a difference? [5] e. Choose an appropriate method to measure the variable importance and report the variable importance results for two of your models. [5] 3. Continue to use the same dataset but only select 3 stocks, your goal now is to identify each focal stock’s potential leaders based on the following lead-lag models: Where for each focal stock i in month t + 1, the independent candidate variables include all the other stock j’s monthly return in month t , as well as the stock i’s monthly return in month t. a. Construct 3 new panel datasets where the index contains month t and the columns include stock i or j’s monthly return. Pre-process the predictive features and ensure there is no missing values. [5] b. Choose an appropriate rolling window length and perform. the above equation using LASSO month by month. Your out-of-sample testing period should be every following month t + 2. [10] [Hint: A longer regression period is likely to reduce noise. But an overly long window will prevent you from uncovering relatively short-lived leader-follower stock pairs.] i. Please describe in detail your training methodology and be careful about look-ahead bias. Choose appropriate metrics to measure the model fit and report both in-sample and out-of- sample performance. ii. How many stocks are identified as stock leaders in the cross section? How many of identified stock leaders share the same two-digit SIC code (i.e., sic2) with the focal stock? How persistent (time-varying) are those identified stock leaders? Please provide tables/graphs to show your findings. Marks will be awarded for: Your work will be assessed in terms of how well you have carried out the various parts of the coursework, details of which are in the following: 1. Appropriate construction of variables to use in the models. 2. Correctly implement the econometric or machine learning tools. 3. Correctness, clarity, completeness and relevance of your interpretations and discussions for each question. 4. Presentation of coursework, including the clear structure of the report, the number of digits in the table (e.g., 2 decimals for all the numbers throughout the report), detailed descriptives for tables/graphs, consistent reference styles, etc. 5. Your understanding, and ability to interpret the software-generated output in terms of the concepts and ideas discussed in the lectures and classes. 6. You will NOT be assessed in terms of how well your model happens to fit the data, or whether you find particular variables are significant.
Unit Assessment Brief MA Internet Equalities Unit Title: Feminist Coding Practices Unit code: IU000145 Unit credit: 20 credits Unit duration: Weeks 2-11 Year / Level: 1/7 Unit briefing date: Thursday 3rd October 2024 Unit introduction In this unit you will be introduced to free and open-source culture and software, considering how feminist approaches can frame. the practice of coding (such as generative code and feminist chatbot). This unit has the explicit aim of acquiring basic coding skills within a community of practitioners and ensuring you develop a foundation to tackle the rest of the course and orient your coding skills towards ethical technology development. Please read the Learning Materials that accompany this document. This may include project briefs, unit guidelines, glossary, additional reading lists or event and presentation information. This information will be published together on Moodle. Learning outcomes and assessment criteria On completion of this unit you will be able to: LO1 Integrate, and deploy algorithms using web technologies, databases and networks (Enquiry) LO2 Research and implement emerging practices around inclusion and community approaches to computation (Process) LO3 Critically discuss issues around computational practice and representation (Knowledge) LO4 Discuss and present creative work within the context of feminist computational practices (Realisation) Assessment Criteria Your work in this unit will be marked against the UAL assessment criteria, which are designed to give you clear feedback on your achievement. The full assessment criteria descriptions can be found on the UAL Assessmentwebpage. What you have to produce You will submit a portfolio project as directed by the unit brief. This unit is assessed holistically (100% of the unit). There is no breakdown of grades for the different assessment evidence listed. Instead, the evidence is considered together and the tutors use their academic judgment to arrive at a grade for the unit as a whole. All components of the assessment must be submitted to pass the unit.
CAN201 Introduction to Networking Networking Project Contribution to Overall Marks 40% Submission Deadline of Part I 15/Nov/2024 23:59 Submission Deadline of Part II Type Team coursework Learning Outcome [A] [B] [C] [D] How the work should be submitted? • SOFT COPY ONLY! • Every team leader must submit the work through Learning Mall. Specification of Part I (20% of overall marks) File uploading and downloading should be one of the most important network-based applications in our daily life. This part of the networking project aims to use Python Socket programming to implement a client-side application for file uploading and downloading based on a given protocol. The examiner will define and release the protocol description and the server-side application on Learning Mall. However, the released server-side code might have some syntax bugs. You should first fix all the bugs and run the server-side code. Then, you should implement the client-side application using Python and test your code using the server-side application. The details are listed as follows: Task 1: Server code debug and set up Fix the existing syntax bugs of the code for the server-side application and run the server-side code. “Server is ready” should be displayed on the terminal window when all the bugs are fixed. Task 2: Get authorization from the server-side Your application should log in to the server using the rule defined in the protocol. A “Token” will be returned if you log in successfully. This token will be used for all the following tasks in this part. Task 3: Upload a file to the server For uploading a file, your application should first apply for this uploading operation using the required information of the file. An uploading plan will be returned, which includes the “key” for permission, the block size for this upload and the total block number. Your application should upload the file block by block until the whole file is uploaded. Then, you should check the status of the file on the server according to the protocol. The MD5 of the file will be included in the status, which could be used to check whether your file has been received by the server properly. Submission: Codes: • >= Python 3.6; • The two python program files, i.e., “server.py” and “client.py” . Project Report: • A cover page with your full names (pinyin for Chinese students; name on your passport for international students) and student IDs of the whole team. • 3 ~ 5 pages (including everything such as the references, excluding the cover page and appendix), single column, 1.25x line space, 2.54cm margins, Serif font, font size:11pt. • PDF format, LaTeX is recommended. • Including: - Abstract - Introduction: project task specification (introduce some background, do not copy from this document and use your own words), challenge (identify the research/development problems you are going to address), practice relevance (come up with the potential applications with your proposal), contributions (key points that you did for this coursework). - Related Work: research papers, technical reports, or similar applications that solve or facilitate network traffic redirection. - Design: the design of you solution including a C/S network architecture diagram (and you need to describe it using your own words), the workflow of your solution (in particular, the steps of performing authorization, fetching token, uploading file), the algorithm (i.e., the kernel pseudo codes of the authorization and file uploading). - Implementation: the host environment where you develop the implementation, such as the host CPU, Memory, Operating System, etc. Also, the development softwares or tools, like the IDE, the Python libraries, etc. Further, steps of implementation (e.g., program flow charts), programming skills (OOP, Parallel, etc.) you used, and the actual implementation of the authorization function, file uploading function. In addition, the difficulties you met and how did you solve them. - Testing and Results: testing environment (can be more or less the same with your host implementation environment), testing steps (the steps of using the developed Python programs to complete the project tasks 1-3, including snapshots), and testing results, i.e., the time used for uploading the whole file, and you should apply figures of bars or curves for showing average performance. - Conclusion: what you did for this project and any future work for improvement. - Acknowledgment: individual contribution percentage should be clarified here if the project is a teamwork by using this format: Student1’s name (ID) contributes XX% to the project, Student2’s name (ID) contributes XX% to the project, Student3’s name (ID) contributes XX% to the project, Student4’s name (ID) contributes XX% to the project and Student5’s name (ID) contributes XX% to the project. If there is no clarification of individual contribution, it is considered that all the individual team contributes the same percentage to the project. - References [IEEE format] Meanwhile, you have to follow the compulsory requirement (no tolerance2): • Only a ZIP file is allowed to submit; • The ZIP file should be named as: CAN201-CW-Part-I-Student1name-Student2name- Student3name-Student4name-Student5name. • The ZIP file includes two folders, i.e., “Codes” and “Report” . The Codes folder includes all the Python files, and the Report folder includes the report file; • Python files are: server.py, client.py; • The report file should be named as: Report_Part_I.pdf; Allowed Python modules: os, sys, shutil, socket, struct, hashlib, math, tqdm, numpy, threading, multiprocessing, gzip, zlib, zipfile, time, argparse,json, logging and other approved modules (send me email for approval). Marking Criteria The following marking scheme is for the team, and every team member shall contribute to the project. Also, several specific rules should be followed: 1. Every team should use the “ACKNOWLEDGMENT” section of the IEEE template to describe the individual contribution(s) using the following format: Student1’s name (ID) contributes XX% to the project, Student2’s name (ID) contributes XX% to the project, Student3’s name (ID) contributes XX% to the project, Student4’s name (ID) contributes XX% to the project and Student5’s name (ID) contributes XX% to the project. 2. If there is no clarification about the individual contributions, it is considered that every team member in the same team has the same contribution percentage and will have the same mark of the CW project. 3. The individual contribution must be in a range: for a 5-person team, it must be 10% - 30% (15% and 30% are included); for a 4-person team, it must be 15% - 35% (15% and 35% are included). If any individual contribution percentage of a team is out of the range (e.g., a 5-person team has the contributions like: 60%, 10%, 10%, 10%, 10%), the team may go through a review by the module leader about the contribution discrepancy. 4. The algorithm for calculating individual mark as follows: a. Assuming the 5-person team’s mark is m, student1 contributes x%, student2 contributes y% and student3 contributes z%, student4 contributes u%, student5 contributes v%. b. The student who gets the most contribution will get mark m. c. Student 1’s mark will be x/max(x,y,z,u,v)*m. d. Student 2’s mark will be y/max(x,y,z,u,v)*m. e. Student 3’s mark will be z/max(x,y,z,u,v)*m. f. Student 4’s mark will be u/max(x,y,z,u,v)*m. g. Student 5’s mark will be v/max(x,y,z,u,v)*m. Report (50% of part I) Marking Criteria Item Mark Contents (40%) Abstract 3% Introduction 5% Related Work 4% Design 8% Implementation 7% Testing and Results 7% Conclusion 3% Reference 3% Typography (5%) Report structure, style, and format 5% Writing (5%) Language 5% Marking Scheme: 1. Contents (40%) 1.1. Abstract (3%) - Good (3%) - Appropriate (1-2%) - No abstract (0%) 1.2. Introduction (5%) - Excellent (5%) - Lack of necessary parts (1%-4%) - No introduction (0%) 1.3. Related Work (4%) - Sufficient (4%) - Not enough (1%-3%) - No introduction (0%) 1.4. Design (8%) - Excellent: adequate and accurate figures and text description (8%) - Reasonable: clear figures and text description (4%-7%) - Incomplete: unclear figures and text description (1%-3%) - No design (0%) 1.5. Implementation (7%) - Excellent: sufficient details of implementation (7%) - Reasonable: clear description of implementation (4%-6%) - Incomplete: unclear description of implementation (1%-3%) - No implementation (0%) 1.6. Testing and Results (7%) - Excellent: sufficient testing description, correct experimental results using figures with clear text description and analysis (7%) - Acceptable: clear testing description, appropriate experimental results using figures with acceptable text description and analysis (3%-6%) - Incomplete: lack of testing description, experimental results with figures, or text description and analysis (1%-2%) - No testing and results (0%) 1.7. Conclusion (3%) - Excellent conclusion (3%) - Acceptable conclusion (1%-2%) - No conclusion (0%) 1.8. Reference (3%) - Excellent reference with the correct IEEE format (3%) - Incorrect or inconsistent reference format (1%-2%) - No reference (0%) 2. Typography (5%) - Excellent and clear typography: 5% - Acceptable typography: 2%-4% - Bad typography: 0% ~ 1% 3. Writing (5%) - Accurate and concise language: 3%-5% - Unclear and confusing language: 1% ~ 2% Codes (50% of Part I) Program testing steps: 1.Debug for server code: Run python3 server.py and “Server is ready” should be displayed on the terminal window. 2.Authorization: Run python3 client.py --server_ip xxx.xxx.xxx --id xxxxxx --f and f“Token: {token}” should be displayed on the terminal window. 3.File uploading: After step 2, your code should upload the file which is indicated in the command line (around 10MB) to the server. The progress should be printed on the terminal window. Marking Scheme: Step 1 (15%) - No bugs and run properly (15%) - Fix each bug, get 2% (2~12%) - No bug is fixed (0%) Step 2 (15%) - Print the right “Token” (15%) - Print something but not the right “Token” (6~14%, no future test for step 3) - Print nothing but can run (1~5%, no future test for step 3) - Cannot run (0%, no future test for step 3) Step 3 (20%) - The server received the right file (checked by MD5) (20%) - The server received a file but not the right file (10% ~ 19%) - The server received nothing (0% ~ 9%)
BU.510.650 Assignment #3 Data Analytics Assignment #3 Attention: Please prepare two files for each homework assignment: the .docx or .pdf file for your answers including figures to each question; the other .R file for your R script. File names should be “LastName FirstName studentID.docx” and “LastName FirstName studentID.R” for assignment 3. All assignments should submitted via Canvas. 1. In this exercise, we will generate simulated data, and will then use this data to perform. best subset selection. (a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector ϵ of length n = 100. (Both X and ϵ follow the standard normal distribution. Use set.seed(100) and set.seed(200) in generating X and ϵ, respectively) (b) Generate a response vector Y of length n = 100 according to the model Y = β0 + β1X + β2X2 + β3X3 + ϵ, where β0 = 3,β1 = 2,β2 = 1,β3 = 0.5 are constants. (c) Use the regsubsets() function to perform. best subset selection in order to choose the best model containing the predictors X, X2 ,..., X 10 . What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best models obtained. Hint: you can use regsubsets(Y~poly(X,10,raw=TRUE),data=data.frame(X,Y)) to perform. best subset with predictor x,x2 , ..., x10 . (d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)? 2. In this exercise, we will perform. subset selection using data set Boston. Note that the response variable is medv. (a) Perform. best subset selection with three predictor. What are the best three predictors? (b) Perform. linear regression with the three best predictors. Is your model significant? How much variability can be explained by this linear model. (c) Perform. forward and backward stepwise subset selection. Find the models with seven predictors. Are they the same as the best subset selection? If different, does the stepwise subset selection lose much in terms of proportion of variability explained by the model?
Asset Pricing Theory – Assignment (by two) Due date: 11:59 pm, November 15th 2024 1 CAPM in a CARA-Normal Setup Consider an economy with 2 dates t = 0 and t = 1 Financial markets: ● At t = 0, K risky securities can be traded at price pk , as well as a risk-free security, at price pf . ● At t = 1, risky security k pays of a random amount per share equal to ˜(a)k ~ N(μk , σk(2)), with Var[˜(a)] the variance-covariance matrix which is invertible by assumption. The risk-free security always pays 1 per share. ● Denote p, = (p1, . . . , pK ) the vector of prices, pf = , and ˜(a), = (˜(a)1 , . . . , ˜(a)K ) the vector of random pay-ofs Investors: ● I investors can trade on financial markets ● At t = 0, investor i ∈ {1, . . . , I} already holds a portfolio of risky securities zi (0) = (z1(i)(0), . . . , zK(i) (0)) (no risk-free security). ● Each investor will trade to obtain a new portfolio of risky securities, (zi ), = (z1(i) , . . . , zK(i)), and risk-free security zf(i) . ● At t = 1, on top of the portfolio’s pay-of, investor i receives a random wage ξi. ● All investors agrees about the pay-ofs distributions ● For a risky pay-of ˜(c)i at t = 1, investor i draws at t = 0 an expected utility, E[ui (˜(c)i )], with ui (x) = -τi exp(-x/τi ). (τi = risk tolerance = 1/risk aversion). Questions: 1. For given prices, compute the initial wealth of investor i. 2. Define the aggregate supply of security k , Sk , for all k ∈ {1, . . . , K}. 3. Give the market equilibrium conditions for all financial markets (risk-free security in- cluded). 4. Give the investor i budget constraint at t = 0. 5. Show that investor i objective function, is equivalent to the following mean-variance criterion: , avec Izi - 6. Determine investor i demand function zi (p). 7. Determine the aggregate demand function z(p) (we will denote τ =εi τi ). 8. Determine the equilibrium vector of prices, p* . 9. Consider the case where there is no random wage (ξi = 0 8i). The objective is to study the risk premium of risky securities. ● Denote ZM = (S1 , S2, . . . , SK ) the market portfolio, pM(*) = (p* )I ZM its price, and˜(a)M =˜(a)I ZM its pay-of. (i) W˜rite the risk premium of security k , as a of the covariance between Write the relation between the market portfolio risk premium, the variance of its pay-of, Var[˜(a)M ]. Write the relation between the risk premium of security k , , and the market portfolio risk premium, ● Let ˜(r)k = - 1 and - 1, the returns of security k and the market portfolio. Write the risk premium of security k , E[˜(r)k ] - rf as a function of the market portfolio risk premium, E[˜(r)M ] - rf . 2 Arrow-Debreu securities in a CARA-Normal Setup Consider an economy with two date t = 0 and t = 1. In this economy there is a risky security that pays-of a random amount per share˜(a) ~ N(µ, σ2 ) at t = 1. The asset aggregate supply is Q. At t = 0 investors pick their portfolios. A portfolio is made of the risky asset and the risk-free asset that pays-of 1 per share at t = 1. The risk-free rate is rf . Investor i draws some expected utility from her final wealth, ˜(y)i a、t = 1. It is computed as E[ui (yi )] where, We will denote τ =Σi τi. 1. Show that the equilibrium price of the risky asset is From now on, we reconsider the problem by an approach 、a la Arrow-Debreu. We assume that at t = 1, there are an infinite number of states of the world. Each state is indexed by the realization of the random variable˜(a) . State of the world a occurs with probability φ(a).da where In state a, each agent i is endowed with wealth (or consumption units) wi (a) such that The Arrow-Debreu security indexed by a pays-of one unit of wealth per share in state a, and zero otherwise. Its price (infinitesimal) is denoted q(a).da. We will take the Arrow- Debreu security indexed by a = 0 as the num/eraire, that is q(0) = 1. Function q is assumed integrable. 2. Explain why agent i Lagrangian to be optimized is 3. For an integrable function f we can define the following derivative Show that the first order conditions of the maximization program, max{yi (a)}a∈R Li , imply 4. Show that the equilibrium prices are 5. Show that the risk free rate is defined as follows 6. Finally, show that a security that pays-of ˜(a) per share at t = 1 has a price equal to
UPPP 146, F24 Problem Set 3, Business Case Analysis using $ and emissions Cost-benefit analysis (CBA) is a key part of business and government project management. It can be used to assess the potential for an investment or choosing among various alternatives by both firms and governments. A “value” is calculated in monetary terms. The goal is to determine the cost effectiveness and economic feasibility of a given decision. Firms look at the cost and benefits that impact them, while governments may consider the pros and cons for many people and firms. CBA focuses on money in and money out – environmental effects are secondary. However, current thought-leaders believe this is short sighted given externality impacts of most every process. Taking this further, materials and processes do not magically appear for the firm’s use but have been acquired and transformed from natural resources; each step in the supply chain produces externalities. Supply chains today are complex networks with numerous suppliers/service organizations who have touched and transported each element and component. To study all aspects of a product or service’s “life,” researchers and practitioners are using Life cycle analysis (LCA) to focus on inputs and outputs (including waste and toxins). A high-level diagram of the life cycle for any commodity is shown in figure 1. Figure 1. From https://www.greenelement.co.uk/blog/a-short-guide-to-product-life-cycle-analyses/cradle-to-grave/ This problem set will introduce you to elements of a CBA, life cycle analysis, and emission accounting using environmental input-output (EIO) tables, a method based on economic input-output analysis, a macroeconomic tool. Input-output models demonstrate how a commodity or industry are dependent. It is a supply-side tool (not demand). For an example of how input-output models work, see Appendix 1. The benefits portion of the case will not be covered. However, you will be asked to consider what benefits should be included. An Excel file has been created with much of the data pre-populated for you. You will need to fill in the remaining pieces on the summary spreadsheet and make a recommendation on the best alternative. Other questions to answer are highlighted in yellow below (to make sure you do not miss them). To complete this exercise, I recommend following this process: A. Read the case study below thoroughly. a. Costs are explicitly listed below and populated in the Excel spreadsheet “summary.” Make sure you understand how the 3 alternatives have been structured in Excel. I recommended tracing each element in the project description to the summary sheet columns B-G, including formulas if not a direct transfer. For example, Alternative 1 says the 550,000 gallons of diesel fuel is expected to use in this scenario per year. Note (1) has the wholesale price of fuel (in 2021). Cell D5 = 1.288083*550000/1000000*30 which calculates the cost of diesel fuel over 30 years in millions of dollars. b. Add data for alternative 3 in column D. i. D19 & D20: Diesel fuel cost is $1.288083/gallon and gasoline is $1.305/gallon (wholesale) c. Column F and G are tailpipe emissions over 30 years. (The box second from right in Figure 1). The other phases (cradle-to-gate and end-of-life) are missing from the case study. Cradle-to-gate emissions will be calculated in step C. (End-of-life is not part of this case, but in practice it should be.) B. Read about EIO-LCA. a. Read the article “Life Cycle Assessment (LCA) – Everything you need to know – Ecochain” b. “The input-output model is a quantitative economic technique that represents the flow of goods and services within an economy, illustrating how industries interact through their inputs and outputs. It captures the relationships between different sectors, helping to analyze how changes in one industry can affect others, making it essential for understanding economic interdependencies and optimizing resource allocation.” (https://library.fiveable.me/key-terms/introduction-to-mathematical-economics/input-output-model) What this means is that there are many more interactions than we normally associate with the supply chain network for a specific product or service. Input-output analysis attempts to identify these relationships. c. Read the pdf “EIO-LCA Overview” by Carnegie Mellon. i. Note that the approach/tool discussed covers only cradle-to-gate. d. Answer these questions: i. Why is LCA an important economic tool to address climate change? ii. How can you see this method being used to improve products and processes? C. Calculate the metric tons of CO2 equivalent units and particular matter (PM) using EIO-LCA tables. a. On the summary table, most of the data has been pre-populated for you except concrete (in orange). b. Analyze the Excel spreadsheet called “eiolca concrete output.” This is a data download from eiolca.net from Carnegie Mellon for $100M of concrete. i. It includes the industry sectors that are involved with making and delivering concrete. It is sorted by total economic value. Notice the ordering of the sectors. 1. What makes up the top 10 industry sectors for CO2 (CO2e)? List with the amount of CO2e in metric tons. 2. What makes up the top 10 industry sectors for PM10? List with the amount of PM10 in metric tons. 3. Do these categories surprise you? Why or why not? c. On the summary page, use the data from the eiolca output concrete sheet (line 3) to populate cells H14, I14, H23, and I23. Hint: The data on the eiolca sheet is for $100M of concrete. D. Complete the emissions section of the summary table. a. Column J: calculate the total CO2e from tailpipe + cradle-to-gate emissions (in MT). b. Column L: calculate the total PM from tailpipe + cradle-to-gate emissions (in MT). Although not technically correct, you can add both PM2.5 and PM10 together. (PM2.5 is part of PM10, but not the other way around) c. (Columns K & M will automatically populate based on formulas in those cells. Likewise the totals will also be shown.) E. Calculate health costs for each alternative. a. Use the data in column F on sheet “health costs.” The health costs of one metric ton of emissions are in $(000)/yr. Reference the information provided in the notes section of this table (column H) to insure you are using the correct $. (A copy of the PM data output data from https://cobra.epa.gov/ is shown in Appendix 2.) b. CO2e costs have been prepopulated for you. Data on source is included in the sheet for your information. c. What are the top 3 health cost categories shown on Cobra? F. Analyze the results and prepare a brief recommendation addressing the following: a. Where are the major impacts? b. What are the initial indications for a preferable alternative? c. What are the limitations or assumptions to your analysis, and what further research may be required to guide this decision? d. What additional information would improve your analysis? e. What benefits would you propose including? Rubric – points possible 50 category 10 points 7.5 points 5 points 0 points LCA Answers both questions fully Answer 1 question fully or both questions poorly Does not answer questions to demonstrate having read the article eiolca concrete output analysis: industry sectors; Cobra health data Complete response (all 4 questions addressed and correct) Partial response (3 questions addressed/correct) Partial response (2 questions addressed/correct) No response or answers are incorrect Summary sheet All empty cells filled in; may have minor mistakes Approx. 75% of cells done, few mistakes Approx. 50% of cells done, few mistakes Less than 50%, major mistakes Alternative recommendation Strong recommendation, thoughtful responses to other questions in section F Strong recommendation, responses to section F questions are not complete Weak recommendation, responses to other questions in section F are missing or not complete No response Grammar/spelling No mistakes 2 mistakes +2 mistakes >5 mistakes Case study: CalTrans and the Metropolitan Transportation Authority (MTA) are seeking advice on a possible major construction project for the 710 Freeway designed to speed the flow of freight and minimize traffic tie-ups. They are proposing three alternatives: 1. Maintain the status quo for shared lanes for cars and trucks. 2. A 15-mile overhead for truck only lanes using standard concrete. 3. A 15-mile overhead for truck only lanes using a long-lasting concrete. Use the hypothetical data points provided below (as produced by consultants preparing environmental documents for the three alternatives) as an input for your analysis. Using the Excel file included, determine the total cost, carbon footprint and changes in respiratory in-organics for the three alternatives over an initial 30-year period. Alternative 1: Maintain the status quo for shared lanes for cars and trucks · Rotating highway repair every 5 years to cover all 15 miles of shared lanes (3 miles per year) with a first-year budget of 25 million dollars for 3 miles of road repair · Estimate of 550,000 gallons of diesel fuel consumption per year (trucks) · Estimate of 750,000 gallons of gasoline consumption per year (cars) · Estimated truck and auto tailpipe emissions of CO2-eq are 210,000 metric tons and 249 PM2.5-eq metric tons of respiratory inorganics over 30 years Alternative 2: 15-mile overhead for truck only lanes using standard concrete · Estimated design cost is 250 million dollars · Estimated construction cost is 500 million dollars (inclusive of all material and labor except concrete) · Concrete cost is 100 million dollars · Rotating highway repair for vehicle-only lanes is 30 years to cover all 15 miles (0.5 miles per year) with a first-year budget of 5 million dollars for 0.5 miles of road repair · Rotating highway repair of truck-only lanes is 5 years to cover all 15 miles of truck lanes (3 miles per year), with a first-year budget of 25 million dollars for 3 miles of road repair · Estimate of 238,956 gallons of diesel fuel consumption per year (trucks) · Estimate of 525,000 gallons of gasoline consumption per year (cars) · Estimated truck and auto tailpipe emissions of CO2-eq are 149,000 metric tons and 119 PM2.5-eq metric tons of respiratory inorganics over 30 years. Alternative 3: 15-mile truck only overhead lanes with long-lasting concrete · Estimated design cost is 250 million dollars · Estimated construction cost is 500 million dollars (inclusive of all material and labor except concrete) · Concrete cost is 400 million dollars · Rotating highway repair for vehicle-only lanes is 30 years to cover all 15 miles (0.5 miles per year) with a first-year budget of 5 million dollars for 0.5 miles of road repair · Rotating highway repair of truck only lanes is 20 years to cover all 15 miles of truck lanes (0.75 miles per year), with a first-year budget of 5 million dollars for 0.75 miles of road repair · Estimate of 238,956 gallons of diesel fuel consumption per year (trucks) · Estimate of 525,000 gallons of gasoline consumption per year (cars) · Estimated truck and auto tailpipe emissions of CO2-eq are 149,000 metric tons and 119 PM2.5-eq metric tons of respiratory inorganics over 30 years Appendix 1. Economic input-output analysis example From https://www.wallstreetmojo.com/input-output-analysis/ dated 8/21/24 by Kosha Mehta “Suppose a local government wants to construct a new bridge and must justify the investment cost. It recruited Sam, an economist, to carry out an input-output analysis. The economist interacted with construction firms and engineers to predict the cost of the bridge, the total number of workers required, and the supplies necessary. Sam converted the details into dollars and ran numbers via an input-output model, producing three impact levels. The direct effect is the original numbers put into that model, for instance, the raw inputs’ value. The direct impact refers to the jobs that supply companies (cement and steel organizations) generate. Such organizations must hire a workforce to complete the entire project. They may already have the required funds. Alternatively, they may borrow the money. This has another impact on the banks. The induced impact refers to the money new workers spend on services and products for their families and themselves. This covers the basics, for example, clothing and food. However, since they now possess more disposable income, this is also associated with the products and services utilized for enjoyment. The impact-output model observes the ripple effects on the economy’s multiple sectors due to the local government building a new bridge. The government may need to bear specific costs for the bridge and utilize taxes. That said, the analysis will help understand the benefits generated by the project by recruiting companies that hire a workforce that spends in the economy, thus helping it expand.” Appendix 2. Cradle-to-gate diesel & gas health cost for 1 ton (not metric ton) of PM. From https://cobra.epa.gov/
BIO2101 Comprehensive Biology Laboratory Exercise 5: Polymerase chain reaction (PCR) replication of β-galactosidase gene LacZ and gel electrophoresis verification Purpose: 1. Understand the principle of DNA fragment replication via PCR methods and practice it. 2. Perform. DNA gel electrophoresis to identify base pair number of a certain DNA fragment. I. Introduction: Part 1: polymerase chain reaction (PCR) replication of LacZ In the biological sciences there have been technological advances that catapult the discipline into golden ages of discovery. The development of the polymerase chain reaction (PCR) is one of those innovations that changed the course of molecular science with its impact spanning countless subdisciplines in biology. The theoretical process was outlined by Keppe and coworkers in 1971; however, it was another 14 years until the complete PCR procedure was described and experimentally applied by Kary Mullis while at Cetus Corporation in 1985. Automation and refinement of this technique progressed with the introduction of a thermal stable DNA polymerase from the bacterium Thermus aquaticus, consequently the name Taq DNA polymerase. PCR is a powerful amplification technique that can generate an ample supply of a specific segment of DNA (i.e., an amplicon) from only a small amount of starting material (i.e., DNA template or target sequence). A standard polymerase chain reaction (PCR) setup consists of following steps: 1. Designing Primers Designing appropriate primers is essential to the successful outcome of a PCR experiment. When designing a set of primers to a specific region of DNA desired for amplification, one primer should anneal to the plus strand, which by convention is oriented in the 5' → 3' direction (also known as the sense or nontemplate strand) and the other primer should complement the minus strand, which is oriented in the 3' → 5' direction (antisense or template strand). Below is a list of characteristics that should be considered when designing primers. 1) Primer length should be 15-30 nucleotide residues (bases). 2) Optimal G-C content should range between 40-60%. 3) The 3' end of primers should contain a G or C in order to clamp the primer and prevent "breathing" of ends, increasing priming efficiency. DNA "breathing" occurs when ends do not stay annealed but fray or split apart. The three hydrogen bonds in GC pairs help prevent breathing but also increase the melting temperature of the primers. 4) The 3' ends of a primer set, which includes a plus strand primer and a minus strand primer, should not be complementary to each other, nor can the 3' end of a single primer be complementary to other sequences in the primer. These two scenarios result in formation of primer dimers and hairpin loop structures, respectively. 5) Optimal melting temperatures (Tm) for primers range between 52-58 °C, although the range can be expanded to 45-65 °C. The final Tm for both primers should differ by no more than 5 °C. 6) Di-nucleotide repeats (e.g., GCGCGCGCGC or ATATATATAT) or single base runs (e.g., AAAAA or CCCCC) should be avoided as they can cause slipping along the primed segment of DNA and or hairpin loop structures to form. If unavoidable due to nature of the DNA template, then only include repeats or single base runs with a maximum of 4 bases. Notes: l There are many computer programs designed to aid in designing primer pairs. NCBI Primer design tool http://www.ncbi.nlm.nih.gov/tools/primer- blast/ and Primer 3http://frodo.wi.mit.edu/primer3/are recommended websites for this purpose. l In order to avoid amplification of related pseudogenes or homologs it could be useful to run a blast on NCBI to check for the target specificity of the primers. 2. Prepare materials and reagents. Arrange all reagents needed for the PCR experiment in a freshly filled ice bucket, and let them thaw completely before setting up a reaction. 1) Standard PCR reagents include a set of appropriate primers for the desired target gene or DNA segment to be amplified, DNA polymerase, a buffer for the specific DNA polymerase, deoxynucleotides (dNTPs), DNA template, and sterile water. 2) Additional reagents may include Magnesium salt Mg2+ (at a final concentration of 0.5 to 5.0 mM), Potassium salt K+ (at a final concentration of 35 to 100 mM), dimethylsulfoxide (DMSO; at a final concentration of 1- 10%), formamide (at a final concentration of 1.25-10%), bovine serum albumin (at a final concentration of 10-100 μg/ml), and Betaine (at a final concentration of 0.5 M to 2.5 M). 3. Setting up a Reaction Mixture Template DNA (Plasmid remaining in PCR tube) ≈1μl Forward Primer, 10 μM each 0.5 μl Reverse Primer, 10 μM each 0.5 μl 2x PCR Mix 10 μl ddH2O 8 μl Total volume 20μl 4. PCR thermocycling. Step Temperature Time Initial denaturation 95°C 2 min 35 cycles 95°C 60°C 72°C 30 s 30 s 30 s / 1kb Final Extension 72°C 10 min Hold 4°C - PCR product can be stored at -20°C for over one week. Part 2: DNA electrophoresis in agarose gel Gel electrophoresis is the standard lab procedure for separating DNA by size (e.g., length in base pairs) for visualization and purification. Electrophoresis uses an electrical field to move the negatively charged DNA through an agarose gel matrix toward a positive electrode. Shorter DNA fragments migrate through the gel more quickly than longer ones. Thus, you can determine the approximate length of a DNA fragment by running it on an agarose gel alongside a DNA ladder (a collection of DNA fragments of known lengths). A standard DNA agarose gel electrophoresis consists of following steps: 1. Preparation of the Gel Weigh out the appropriate mass of agarose into an Erlenmeyer flask. Agarose gels are prepared using a w/v percentage solution. The concentration of agarose in a gel will depend on the sizes of the DNA fragments to be separated, with most gels ranging between 0.5%-2%. The volume of the buffer should not be greater than 1/3 of the capacity of the flask. Add running buffer to the agarose-containing flask. Swirl to mix. The most common gel running buffers are TAE (40 mM Tris-acetate, 1 mM EDTA) and TBE (45 mM Tris-borate, 1 mM EDTA). Melt the agarose/buffer mixture. This is most commonly done by heating in a microwave, but can also be done over a Bunsen flame. At 30 s intervals, remove the flask and swirl the contents to mix well. Repeat until the agarose has completely dissolved. Add DNA staining reagent (Gel Blue) to the mixture. Alternatively, the gel may also be stained after electrophoresis in running buffer containing staining reagent for 15-30 min, followed by destaining in running buffer for an equal length of time. Place the gel tray into the casting apparatus. Alternatively, one may also tape the open edges of a gel tray to create a mold. Place an appropriate comb into the gel mold to create the wells. Pour the molten agarose into the gel mold. Allow the agarose to set at room temperature. Remove the comb and place the gel in the gel box before use. 2. Setting up of Gel Apparatus and Separation of DNA Fragments Add loading dye to the DNA samples to be separated. Gel loading dye is typically made at 6X concentration (0.25% bromphenol blue, 0.25% xylene cyanol, 30% glycerol). Loading dye helps to track how far your DNA sample has traveled, and also allows the sample to sink into the gel. 3. Observing Separated DNA fragments Program the power supply to desired voltage (1-5V/cm between electrodes). Replace the gel to the gel box. The cathode (black leads) should be closer the wells than the anode (red leads). Double check that the electrodes are plugged into the correct slots in the power supply. Turn on the power. Run the gel until the dye has migrated to an appropriate distance. Remove the gel from the gel tray and expose the gel to UV light. This is most commonly done using a gel documentation system. DNA bands should show up as fluorescent bands. Take a picture of the gel. II. Procedure: PCR of target gene and gel electrophoresis verification 1. Design PCR primers for gene replication. LacZα-F: 5’- ATGACCATGATTACGCCAA -3’ 19bp Tm = 53°C 42% GC LacZα-R: 5’- CTATGCGGCATCAGAGCA -3’ 18 bp Tm = 55°C 56% GC 2. Gene replication via PCR method Prepare PCR replication system in the PCR microtube with plasmid remaining as follows: Template DNA (Plasmid remaining in PCR tube) ≈ 1μl Transfer PCR tubes from ice to a PCR machine and begin thermocycling as follows: Step Temperature Time Initial denaturation 95。C 2 min 35 cycles 95。C 60。C 72。C 30 s 30 s 30 s / 1kb Final Extension 72。C 10 min Hold 4。C - Note: Step 1 has been done by TA. Step 2 was done last week. Samples were stored at -20。C for one week. 3. Preparation of the Gel Weigh out 0.6g agarose into an Erlenmeyer flask and add 60ml TAE buffer. Swirl to mix. Melt the agarose/buffer mixture by heating in a microwave. At 30 s intervals, remove the flask and swirl the contents to mix well. Repeat until the agarose has completely dissolved. 4. Add 6µL DNA staining reagent (Gel Blue) to the mixture and mix well by gentle shaking. Place the gel tray into the casting apparatus. Place an appropriate comb into the gel mold to create the wells. Pour the molten agarose into the gel mold. Allow the agarose to set at room temperature (about 20 minutes). 5. Add 2µL DNA 6x loading dye into the 10µL PCR system prepared last week. Mix well by gentle pipetting. 6. Remove the comb and place the gel in the gel box containing TAE buffer before use. Add 2.5µL Ladder and 10μl DNA samples into wells. 7. Replace the gel to the gel box. The cathode (black leads) should be closer the wells than the anode (red leads). Double check that the electrodes are plugged into the correct slots in the power supply. 8. Turn on the power (160V). Run the gel until the dye has migrated to an appropriate distance (About 15 minutes). 9. Gel imaging under UV exposure. DNA fragment analysis. Experimental Datasheet of Exercise 5 Put here a picture of PCR sample gel electrophoresis result labeling your sample as well as DNA marker with different molecular weight. Try to analyze every band of your PCR sample.
Department of Electrical and Computer Engineering EECE-7204 Applied Probability & Stochstic Processes FALL-2024 Homework-6 Problem Set Due on November 16, 2024 P6.1 A random sequence Xn is defined by Xn = Asin(Ωn), n ≥ 0, in which A and Ω are discrete random variables described by their joint probability mass function (pmf) Determine: (a) the marginal density fX3 (x), (b) the joint density fX1,X5 (x, y), (c) the mean sequence μX ( n), (d) the auto-correlation bi-sequence R X( m, n), and (e) whether the sequence Xn is • a strict-sense stationary, • a wide-sense stationary, • an independent random sequence, • an uncorrelated random sequence, and • an orthogonal random sequence. P6.2 Stark/Woods Text Problem 8.10 (Page 528). For a better clarity, this problem is modified as follows. (a) The components, of the random-10 vector X satisy “non-causal” recursive equa- tions, given by with “initial” and “final” equations, respectively, given by Using (1) and (2), arrange the above 10 scalar equations in the vector form. as X = BX + CW (3) where B and C are 10 × 10 matrices of numbers, derived from (1) and (2), and W = { Wi} is a random-10 vector. Now solve for the vector X and arrange the solution as By carefully examining the numbers in matrix A, verify that the scalar equations for Xi are given by You will need equation (5) to solve the following parts. What is the value of ρ in (5)? (b) Determine the mean vector μX . (c) Determine the ACVM KX . (d) Write an expression for the multidimensional pdf, fX (x), of the random vector X. P6.3 Stark/Woods Text Problem 8.21 (Page 531). P6.4 Stark/Woods Text Problem 8.29 (Page 534). P6.5 Let X [ n] be an independent and identically distributed (IID) random sequence with PDF, at each n, given by fX (x; n) = 2x[ u (x) − u(x − 1)]. Another random sequence Y [ n] is defined as (a) Determine μX [ n] of X [ n]. (b) Determine R X [ n 1, n2] of X [ n]. (c) Determine the mean sequence μY [ n], n ≥ 1 of Y [ n]. (d) Determine the variance sequence σY(2) [ n], n ≥ 1 of Y [ n]. (e) Let W = Y [ n 1] − Y [ n2], n 1 , n2 ≥ 1 be a random variable. Determine the variance σW(2) of the random variable W. P6.6 Stark/Woods Text Problem 8.48. (Page 539). P6.7 Stark/Woods Text Problem 8.51. (Page 539-540). P6.8 [MATLAB Problem] In this problem you will generate M = 100 sample sequences of length N = 1000 of the random- walk process with step-size S = 1 and analyze its statistical properties. (a) First generate M independent sequences W [ n] of length N + 1 in which each sample takes value equal to +S or −S with probability 1/2. Now perform. cumulative sum (i.e., use the MATLAB function cumsum) on each sample sequence to obtain M realizations of the random-walk process X [ n] of length N. Plot all sample sequences in one figure. Submit a printout of your script. and a printout of your plot. (b) Estimate and plot the mean sequence μX [ n] by averaging over M realizations. How does it compare with the actual mean sequence given in the lecture notes? Submit a printout of your script. and a printout of your plot. (c) Estimate and plot the variance σX(2)[ n] by averaging over M realizations. How does it com- pare with the actual variance sequence given in the lecture notes? Submit a printout of your script. and a printout of your plot. (d) Comment on the stationarity of the random-walk process.
Data Case Analysis (20pts, word limit 8 pages, 12-point font with 1.5 spacing, including all components such as figures, graphs and tables) Overview The objective of data case analysis is threefold. One is to provide you with an opportunity to demonstrate your ability to draw marketing insights by using real world data involving real business challenges. The second is to evaluate your ability to conduct critical marketing research including the entire processes of marketing research such as data preparation, data analysis, and communication and presentation of the key findings. Note that we do not expect any advanced statistical tests for this data case analysis. Remember that you should be professional by achieving effective visual and written communication. Data case analysis should be no longer than 8 pages (Word document), including all components such as tables, figures, references, although we don’t penalize upon the word count and page length. Scenario You are leading the headquarter marketing research team at the TransUnion LLC. TransUnion is an American consumer credit reporting agency. TransUnion collects and aggregates information on over one billion individual consumers in over thirty countries including "200 million files profiling nearly every credit-active consumer in the United States". Its customers include over 65,000 businesses. Based in Chicago, Illinois, TransUnion's 2014 revenue was US$3.83 billion. It is one of the three largest credit agencies, along with Experian and Equifax (known as the "Big Three") Currently, your team is undertaking an important marketing research project which aims (1) to understand the state of customer dissatisfaction about the credit reporting services in the United States, (2) to quantify and visualize the historical and geographical trends in the consumer complaints, (3) to understand the state of consumer dissatisfaction towards the Big 3 credit reporting agencies including your company (TransUnion) and the two major competitors (Experian and Equifax), and (4) to improve customer experience and complaint management system at the TransUnion. Data A dataset about consumer complaints about financial products and services are obtained from the Consumer Financial Protection Bureau (CFPB: https://www.consumerfinance.gov/) which is an agency of the United States government responsible for consumer protection in the financial sector. Each week CFPB sends thousands of consumer complaints about financial products and services to companies for response. Data from those complaints helps to understand the financial marketplace and protect consumers. For further information of CFPB and its databases, you must explore their website (https://www.consumerfinance.gov/). The database includes general and descriptive information of about 500,000 consumer complaints and company responses about the Big 3 credit reporting companies in the U.S. The information available in the database can be found here. This dataset is enhanced further with US census data which includes a number of socio-demographic information at the county-level (FIPS) from SVI. A complete dictionary and description for the variables in the dataset is provided on Canvas (Data Dictionary). Note that you will be provided with a different dataset depending on your SID. Please check your SID and the last digit of SID, and download and use an appropriate dataset. Don’t try to collaborate with other students as it is an individual assignment involving different datasets. In addition, you are expected to bring different and unique angles to the dataset. Key Deliverables in Data Case Analysis Task 1 (10 pts): In the first task, you must provide “thick descriptions” of the state of customer dissatisfaction about the financial products and services in the United States. For this task, you may consider the following questions to answer, but you can attempt to generate other insights about the state of customer dissatisfaction about credit reporting services in the United States. o What is the state of customer dissatisfaction? Quantify and visualize the historical and geographical trend in the complaint volume and complaint rate by location (e.g., ZIP, FIPS, or States) or issues? o How effectively have complaint cases been managed over time? o Are there any event-related (i.e., Covid-19) or seasonal effects on the volume of complaints? o Are there any socio-demographic factors influencing consumer complaints about credit reporting services? Task 2 (10 pts): In the second task, you want to understand customer dissatisfaction about your company (TransUnion) and major competitors (Experian and Equifax). In other words, you have to provide some descriptive insights that can be translated into prescriptive insights for customer management and complaint handling system. Note that Task 2 is a very open-ended question, meaning that you can communicate any strategic insights you find important and interesting. Below are some example questions you may try to address: o How many complaints have TransUnion and its major competitors received? And where do such complaints come from? What is the historical trend in the complaint volume by company or before and after Covid-19? o What kinds of key performance metrics you would use to evaluate the complaint management performance? Based upon some metrics, discuss and compare how effectively TransUnion and its competitors have managed consumer complaints. o Produce a couple of recommendations for TransUnion to manage consumer complaints better than its competitors. Tips…. The above answers are examples only. You may decide on focus on other questions not mentioned above. Consider the use of Tableau.
Department of Computer Science COM1000 Group Project (2024-25 Semester 1) Improving Existing Applications using Contemporary Information Technologies Item / Task Period/Deadline Note Presentation materials (e.g., PowerPoints) 16 November, Saturday (23:59) One copy per group File name: L0X-GP-GroupY.pptx In-Class Oral presentation Week 12 & 13 To be determined by instructors Final report (in MS Word) FIVE IEEE documents [PDF] 7 December, Saturday (23:59) One copy per group File name: L0X-GP-GroupY.docx FIVE Individual PDF files per group Note: L0X and GroupY are your class and group number, e.g., L01-GP-Group9 represents Group 9 in class L01. Guidelines 1. Each group is required to deliver a presentation (max. 20 mins excluding Q&A) in Week 12 & 13, and to submit a project report in Week 14 (max. eight A4 pages, font size 12 or above, excluding cover page and references. Citation in the text is required). In both the presentation and the report, you should concisely describe the following: a. find an existing application, product or service that is not intelligent and not good enough. Identify and discuss its deficiencies under different applicable scenarios. b. propose appropriate contemporary information technologies that you have learned from the COM1000 module and discuss how these technologies can be used to resolve those deficiencies for your chosen application, product or service in part (a). c. discuss any technical or functional limitations of your upgraded application, product or service in part (b), e.g, why would the application, product or service not-so-easy to be implemented or realized? d. discuss the impacts of your upgraded application, product or service on end users, e.g., i. benefits (e.g., cost/time saving, improve customer satisfaction, etc.) ii. concerns (e.g., privacy, convenience, service quality and performance, etc.) Please note that we value quality (how good the report and analysis is) over quantity (how many pages or words you have written or how many slides you have). 2. Each group is required to use IEEE Xplore digital library and gather detailed and current information on the topic of your project. During the project preparation, download FIVE PDF documents that are relevant to your topic and submit them to the Moodle together with your final report. Please refer to the appendix for the guidelines of downloading the journal articles in IEEE Xplore digital library. Example: The traditional class attendance system at HSUHK which allows teachers to take student attendance records manually. Application Class attendance system to record student attendance Deficiencies (1) it takes a long time to complete the attendance recording, particularly when the number of students in a class is large. (2) students may help the others to take attendance dishonestly. Proposed contemporary information technologies Deep learning (facial recognition) which recognizes the students’ face for taking attendance automatically Limitation/performance Poor image quality or wearing masks could affect the performance of facial recognition Benefits Time and cost-saving, automation, etc. Concerns Breach of privacy, misidentification, etc. Note ● For the presentation, you may leverage different forms and media of presentation to convince the audience the superior value of your project. These forms and media may include (but not mandatory) slides, role-play, etc. Your group can decide how best you want to convey the essence of the project that you and your group members have completed. ● Everyone must participate in the presentation and report writing. ● For the report, good use of figures and illustrations, where appropriate, may help the readers to understand the contents. ● Your topic MUST NOT be the same as your previous presentations. ● Free riders may receive a lower or zero mark for the group project. Assessment criteria Group Presentation (Full mark: 40%): Not more than 20 minutes. Week 12 - 13. Assessment items Marks Background and functionality 8% Contemporary Technology (description, how it works, and limitation/performance) 16% Impact 8% Presentation skill and style. PowerPoint slide design. 4% Project originality and feasibility 4% Project Report (Full mark: 60%): Max. 8 pages. Week 14. Assessment items Marks Background and functionality 10% Contemporary Technology (description, how it works, and limitation/performance) 22% Impact 12% Project originality and feasibility 6% Others: Report format, clarity, readability 5% FIVE IEEE Xplore documents which are related to your topic 5% Projects that are original, insightful, and refreshing, and those that represents good examples of the contemporary information technologies will be rated higher.
IU000145 Assignment brief for Feminist Coding Practices 2024-2025 Task: submit both parts as described below. Part 1 - A piece of computer code that is executable (e.g. code snippet that you contributed to your group’s live coding performance) that can be run in a web browser. It should be submitted as a JS file, with comments that explain the code functionality. Part 2 - Submit a summary within 600 words, including: • the title of the group project; • your own critical reflection on how your understanding of feminist coding practices and the impact of internet’s carbon footprint informed what you did in part 1; • concluding thoughts on the future direction of the work and how you see it benefiting a broader societal context;
Assignment Remit Programme Title Department of Economics Module Title LH Advanced Macroeconomics Module Code 07 33109 Assignment Title Assignment (Main) Level LH Weighting 50% Hand Out Date 06/11/2024 Deadline Date & Time 19/12/2024 12 noon Feedback Post Date 29/01/2025 Assignment Format Other Assignment Length See below Submission Format Online Individual This is a 50% assignment in total, split in two parts (Questions 1 and 2 i.e. Q1 and Q2) which are equally weighted. This part is Q1 which is a written assignment. Q2 is a video submission which is in a different submission box. The deadline for both partsQ1 and Q2 is 19th December. Module Learning Outcomes: This assignment is designed to assess the following module learning outcomes. Your submission will be marked using the Grading Criteria given below. • Analyse the theoretical models of modern macroeconomics research to discuss the main issues relating to business cycles and long-run growth; • Analyse issues relating to business cycles and long-run growth both in the UK and in the wider international economy; • Appraiseselected papers from professional journals. Q1. Consider the two-period Real Business Cycle (RBC) model without uncertainty presented in the lecture slides (also Romer, 2019, ch.5) but now assume that u(•), for households, takes the form. where ct is consumption at time t and (1–{t) is leisure time at time t. Given that the time endowment is normalised to 1, it follows that ℓt is hours worked at time t. Note that ut contains three parameters: θ>0, b>0 and γ>0. All households in the economy are assumed to be identical; we can therefore consider a ‘representative household’ (henceforth ‘the household’). Set t=1 for the present period and set t=2 for the next period. For example, c1 is consumption in the present period and c2 is consumption in the next period. This is a two-period model so there are no time periods prior to t=1 and there are no time periods after t=2. Assume that the household begins and ends life with no accumulated wealth and that the real interest rate is r (where r>0). The intertemporal budget constraint is therefore: Answer the following questions: a) Present the Lagrangian problem for the household under this model specification. Briefly explain why we need to use the Lagrangian technique. [10%] b) Derive the first order conditions for the household in this case. [10%] c) Use the first order conditions for ℓ1 and ℓ2 to derive an expression for the relative amount of leisure time chosen by the household over the two periods, i.e. derive an expression for (1–ℓ1)/(1–ℓ2). Explain how an increase in the relative wage (w2/w1) affects the household’s decision about how much leisure to take in each period. [10%] d) Calculate the intertemporal elasticity of substitution between period 1 and period 2 leisure time in this case. [10%] e) Labour economists typically estimate that the intertemporal elasticity of substitution for leisure is small empirically. Why is this problematic for the RBC model considered here when comparing the predictions of the theory to relevant empirical evidence for the US or the UK? [10%] f) Aside from the issue considered in part (e), outline two other shortcomings of Real Business Cycle theory as a framework for understanding short-run economic fluctuations (i.e. business cycles) for an economically developed country such as the US or the UK. [50%] For Parts (a) to (e): there is no word limit and referencing is not required. For Part (f): you should conduct your own wider reading/research; the word limit for this sub-part is 500 words (+10% tolerance) and you should reference according to the Harvard system.
ISE529 Predictive Analytics 2024 Fall Homework 5 Due by: Nov. 20, 2024, 11:59 PM Instructions: 1. Print your First and Last name and NetID on your answer sheets 2. Submit all your answers including Python scripts and report in a single Jupyter Lab file (.ipynb) or along with a single PDF to Brightspace by due date. No other file formats will be graded. No late submission will be accepted. 3. Total 3 problems. Total points: 100 1. (30 points) Predict per capita crime rate in the Boston.csv data set. Split the data set into 70% for a training set and 30% for a test set. Fit a lasso model, ridge regression model, and PCR model respectively. Use cross-validation method to determine λ and M (the number of PCs). Present the test error and discuss results for the approaches that you consider. 2. (30 points) Predict the number of applications received using the other variables in the College.csv data set. Split the data set into 60% for a training set and 40% for a test set. (a) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained. (b) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. (c) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. 3. (40 points) Use the following code to generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. (a) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on they-axis. (b) Fit a logistic regression model to the data using X1, X2, X12, X22, and X1×X2 as predictors. Obtain a class prediction for each training observation (using full data set). Plot the observations, colored according to the predicted class labels. (c) Fit a SVM using anon-linear kernel (polynomial with d>1 or RBF kernel) to the data. Obtain a class prediction for each training observation (using full data set). Plot the observations, colored according to the predicted class labels. (d) Comment on your results.
MTH305 Risk Management Coursework Project Instructions November 2024 1 General Information 1. The coursework contributes 15% towards your final module grade. 2. In the coursework, you will work with a portfolio of stocks and bonds, and assess its market risk. 3. The coursework covers weeks 1 to 7, with the class demonstration file in week 7 and the lab in weeks 4 and 7 especially related to the MATLAB work. 4. You should use MATLAB only, and please use the provided files only. 5. Open the live script. file ’MTH305 CW2.mlx’ and follow the instructions therein. 2 Submission 1. The submission deadline is 3pm on 22 November 2024 (Friday of week 10). 2. Submit the result file ’results.mat’ in the first submission link (do not rename or zip the file). • Not making the correct submission will have a penalty of 10 marks. 3. Submit other files in the second submission link (do not rename or zip files): • the live script. file ’MTH305 CW2.mlx’ that contains your work • functions you have defined [it is not required that you define functions]. • Not making the correct submission will have a penalty of 10 marks. 4. Please do not submit the data file ’MTH305 CW2 data.mat’, the functions ’im- port data.m’ and ’VaR count.m’, or MATLAB built-in functions. 5. You can edit your submissions by clicking the button ”Edit submission” . Editing submission after the deadline will be viewed as late submission. 6. Late submission will have a penalty of 5 marks per working day, up to a maximum of five working days. • Less than one working day will be counted as one working day. • Work received more than five working days after the submission deadline will receive a mark of zero. 7. Abnormal similarity of MATLAB code will be subjected to investigation as per the academic integrity policies, and in the worst case, resulting in zero mark and the disciplinary process. 8. Students who believe that their performance may have been impaired by illness or other exceptional circumstances should follow the procedures set out in the Univer- sity’s ’Mitigating Circumstances Policy’ . • Apply for Extension of Coursework Submission Deadline (e-Bridge – Academic Records – Mitigating Circumstances). • The application must be made before the original submission deadline.
ECE 427 Assessment 2 – Part II: Take Home Exam – Transmission Line Design You’re an engineer on a team that is designing a brand-new overhead transmission line from Dubuque, Iowa, to Madison. This would allow more wind power to flow from Iowa to the population centers in southern Wisconsin. The three-phase 60 Hz, 345 kV line will have a length of 160 km. Below, you are given a selection of conductors with different names (power line designers like birds) and different sizes, diameters, resistance, and cost. These values are given for a single conductor. In your calculations, you can assume that the conductors are solid cylinders with the given diameters. Also, assume that the conductor temperature stays at 25°C. You can also choose between several bundle configurations and tower configurations, given below. The design optimization goal is to minimize the following cost function by choosing the conductors, their configuration, and determining the resulting line parameters and the cost. The cost function is defined as: Where • CF = Cost function to minimize in the design process • line$ = total cost of conductor material in USD (You can neglect sagging here, assuming a length of 160 km for each conductor you use.) • plo(r)ss(at)ed = Line losses calculated using Ir(2)atedR where R is the per-phase resistance in km/NameSize (kcmil)Diameter (cm)Resistance (Ω/km @ 25°C,60 Hz)Cost (USD/m)Chickadee397.51.8870.14469$3.18Pelican4772.0680.12073$3.94Osprey556.52.2330.10400$4.79Drake7952.8140.07185$7.81Cardinal9543.0380.06135$8.83Ortolan1033.53.0780.05774$9.32Bittern12723.4160.04757$10.76Lapwing15903.8200.03871$14.01
EMATM0061 Summative Assessment Statistical Computing and Empirical Methods, Teaching Block 1, 2024 Introduction This document contains the specification for the summative assessment for the unit Statistical Computing and Empirical Methods, TB1 2024. Please read carefully the following instructions before you start answering the questions. Deadline. Your report is due on 28 November 2024 at 13:00. Rules: This is an independent task. For the summative assessment you should not share your answers with your colleagues. The experience of solving the problems in this project will prepare you for real problems in your career as a data scientist. If someone asks you for the answer, resist! Support: Whilst this is an independent task, there is a lot of support available if you need it. If you are unclear about what is required for any part of the assessment then discuss this issue with the our teaching team in the computer lab or contact your unit director. Plagiarism: Be very careful to avoid plagiarism. For more details, you should consult the “Academic Integrity” section under the Assessment tab within the central Blackboard page for the School of Engineering Mathematics & Technology. The use of generative AI: The use of generative AI, such as ChatGPT, is prohibited. Any use of generative AI in this assessment will be considered as plagiarism. Extenuating circumstances: For more details on the procedure for extenuating circumstances consult the “Assessment support options” section under the Assessment tab within the central Blackboard page for the School of Engineering Mathematics & Technology. Late submission panalty: Coursework that is submitted after a deadline should be subject to a late submission penalty, unless there is an extension or a justified exceptional circumstance. The more details, you should consult the central Blackboard page for the School of Engineering Mathematics & Technology or contact the School office. Clarity: Clarity is highly important. Be careful to make sure you clearly explain each step in your answer. You should also include comments within your code when necessary. Your answer should clearly demarcate which part of the question you are answering. Whenever possible, include pieces of well-written codes in your report to promote clarity. Programming language: For Section A of this coursework you should use Tidyverse methods within the R programming language. For Section B and Section C, you can use either R or Python. Regardless of your choice of language, it is essential that your answers are clear and well-written. Submission points: To submit your solutions, please visit the “Assessment, submission and feedback” tab on the course webpage at Blackboard. Make sure your submission follows the submission structure described below. Multiple submissions: Submitting the coursework multiple times before the deadline is allowed. However, only the last submission will be considered for marking. You can try to submit an temporary copy before your final submission if you like. Submission structure: Please submit a single zip file that contains a folder named “SCEM_??? ” where “???” should be replaced by your unique UoB username (e.g., lf22553). The folder should contain three subfolders named “A”, “B” and “C”. 1 Subfolder "A" should include 1) a PDF file that contains your answers to Section A, and 2) a folder containing the code and data being used for Section A. 2 Subfolder "B" should include 1) a PDF file that contains your answers to Section B, and 2) a folder containing the code and data being used for Section B. 3 Subfolder "C" should include 1) a PDF file that contains your answers to Section C, and 2) a folder containing the code and data being used for Section C. Time allocation: Section A & B and Section C both contain 50 marks, but we recommend that you allocate more time for the tasks in Section C, for example 40% on Section A & B and 60% on Section C. Section A (20 marks) General instruction: In this part of your assessment, you will perform. a data wrangling task using R programming. Note that clarity is highly important. Be careful to make sure you clearly explain each step in your answer. You should also include comments within your code when necessary. In addition, make the structure of your answer clear through the use of headings. You should also make sure your code is clean by making careful use of Tidyverse methods in R. (Q1). First download the files entailed "debt_data.csv", "country_data.csv" and "indicator_data.csv" which are available within the Assessment section within Blackboard. The file "debt data.csv" contains debt data for different countries under different indicators, from 1960 to 2023. The indicators are represented by indicator codes (for example, NY.GNP.MKTP.CD). The file "indicator data.csv" contains a list of the indicator names as well as their associated indicator codes. The file "country data.csv" contains information about the country code, income levels, and regions for each country. First, Load the file "debt data.csv" into an R data frame. called "debt df", load the file "coun- try data.csv" into an R data frame. called "country df", and load the file "indicator data.csv" into a data frame. called "indicator df". Second, use R to check the number of columns and the number of rows that the data frame. "debt df" has. Display your results. (Q2). Update "debt_df" by reordering its rows such that the values of the indicator "DT.NFL.BLAT.CD" is in descending order. Display a subset of the updated "debt df" consisting of the first 4 rows and the columns "Country.Code", "Year", "NY.GNP.MKTP.CD", and "DT.NFL.BLAT.CD". (Q3). In the data frame. "debt_df", the indicators are represented by their associated indicator codes rather than by their names. The data frame. "indicator df" contains a list of indicator names and their corresponding indicator codes. Create a new data frame. called "debt df2" by combining the data from the two data frames "debt df" and "indicator df". The new data frame. "debt df2" should be equivalent to "debt df" except that "debt df2" now contains indicator names rather than indicator codes. The indicator names in "debt df2" should match the indicator codes in "debt df" according to their correspondence described in "indicator df". Display a subset of "debt_df2" consisting of the first 5 rows and the three columns "Country.Code", "Year", and "Net financial flows, others (NFL, current US$)". (Q4). The data frame. "country_df" contains information about Region, Income groups, and country name for each country. Create a new data frame. called "debt df3" by combining data from the two data frames "debt_df2" and "country_df". The new data frame. "debt_df3" should contains a) all columns from "debt_df2" and b) 3 columns from "country_df" called "Region", "IncomeGroup", and "Country.Name". Make sure that in each row of "debt df3", the "Region", "IncomeGroup", and "Country.Name" match "Country.Code" according to their correspondence described in "country df". Your data frames "debt df3" and "debt df2" should have the same numbers of rows, but "debt df3" has three more columns. Display a subset of "debt_df3" consisting of the first three rows and 4 columns called "Country.Name", "IncomeGroup", "Year", and "Total reserves in months of imports". (Q5). Rename the following 5 columns from their original names to the new names specified below Original column names New column names Total reserves in months of imports External debt stocks, total (DOD, current US$) Net financial flows, bilateral (NFL, current US$) Imports of goods, services and primary income (BoP, current US$) IFC, private nonguaranteed (NFL, US$) Total reserves External debt Financial flow Imports IFC (Q6). Next generate a summary data frame. called “debt_summary” from the data frame. “debt_df3” with the following properties: Your summary data frame. “debt summary” should contain 7 rows corresponding to the 7 different Regions, and it should also have 5 columns: "Region" - the names of the 7 different regions including "East Asia & Pacific", "Europe & Central Asia" etc. "TR mn" - the average of "Total reserves" in each region. "ED md" - the median of "External debt" in each region. "FF quantile" - the 0.2 quantile of "Financial flow" in each region. "IFC sd" - the standard deviation of "IFC" in each region. All missing values should be discarded when computing the summary data. (Q7). Based on your data frame. “debt_df3”, create a violin plot of "Financial_flow" for each of the regions. The violin plots should be displayed in the same figure and with different colors representing different regions. Ignore all missing values and all values that are smaller than −108 or bigger than 108 . Your plot is expected to look as follows. (Q8). Based on the data frame. “debt_df3”, create a plot which displays the "Total_reserves" as a function of the years (from 1960 to 2023), for each of the following countries: Italy, France, United Kingdom, Sudan, Afghanistan, and Brazil. Additionally, the values of "Total reserves" should be displayed in different panels according to the income groups of the countries. Use different colors to represent different countries. Your plot is expected to look as follows. Section B (30 marks) B.1 Suppose a product is being sold in a supermarket. We are interested in knowing how quickly the product returns to the shelf again after it is sold out. Let X be a continuous random variable denoting the length of time between the time point at which it is sold out and the time point at which it is placed on the shelf again. So X should be a non-negative number, and X = 0 means that the product gets on the shelf immediately after it is sold out. Here, we assume that the probability density function of X is given by where b > 0 is a known constant, λ > 0 is a parameter of the distribution, and a is to be determined by λ and b. (1) First, determine the value of a: derive a mathematical expression of a in terms of λ and/or b. (2) Derive a formula for the population mean and standard deviation of the random variable X with parameter λ . (3) Derive a formula for the cumulative distribution function and the quantile function for the random variable X with parameter λ . (4) Suppose that X1 , · · · , Xn are independent copies of X with the unknown parameter λ > 0. What is the maximum likelihood estimate λMLE for λ? Now download the .csv file entitled “supermarket_data_2024” from the Assessment section within Blackboard. The .csv file contains data on the length of time (in seconds) taken by a product to get on the shelf again after being sold out. So the sample is a sequence of time lengths. Let’s model the sequence of time lengths in our sample as independent copies of X (X is the random variable mentioned above) with parameter λ and known constant b = 300 (seconds). Answer the following questions (5) and (6). (5) Given the sample, compute and display the maximum likelihood estimate λMLE of the parameter λ . (6) Apply the method of Bootstrap confidence interval to obtain a confidence interval for λ with a confidence level of 95%. To compute the Bootstrap confidence interval, the number of resamples (i.e., subsamples that are generated to compute the bootstrap statistics) should be set to 10000. Next, conduct a simulation study to explore the behaviour of the maximum likelihood estimator: (7) Conduct a simulation study to explore the behaviour of the maximum likelihood estimator λMLE for λ on simulated data X1 , · · · , Xn (as independent copies of X with parameter λ) according to the following instructions. Let b = 0.01 and the true parameter be λ = 2. Generate a plot of the mean squared error as a function of the sample size n. You should consider sample sizes from 100 to 5000 in increments of 10. For each sample size, consider 100 trials. In each trial, generate a random sample X1 , · · · , Xn (as independent copies of X with parameter λ = 2), and then compute the maximum likelihood estimate λMLE for λ based upon the sample. Display a plot of the mean square error of λMLE as an estimator for λ as a function of the sample size n. B.2 Consider a bag of a red balls and b blue balls (the bag has a + b balls in total), where a ≥ 1 and b ≥ 1. We randomly draw two balls from the bag without replacement. That means, we draw the first ball from the bag and, WITHOUT returning the first ball to the bag, we draw the second one. Each ball has an equal chance of being drawn. Now we record the colour of the two balls drawn from the bag, and let X denote the number of red balls minus the number of blue balls. So X is a discrete random variable. For example, if we draw one red ball and one blue ball, then X = 0. Answer the following questions from (1) to (11). (1) Give a formula for the probability mass function pX : R → [0, 1] of X . (2) Use the probability mass function pX to obtain an expression of the expectation E(X) of X (i.e., the population mean) in terms of a and/or b. (3) Give an expression of the variance Var(X) of X in terms of a and b. (4) Write a function called compute_expectation_X that takes a and b as inputs and outputs the expectation E(X). Write a function called compute_variance_X that takes a and b as input and outputs the variance Var(X). Display your code. In the following questions, we additionally assume that X1 , X2 , · · · , Xn are independent copies of X . So X1 , X2 , · · · , Xn are i.i.d. random variables having the same distribution as that of X . Let X = Σ Xi be the sample mean. (5) Give an expression of the expectation of the random variable X in terms of a, b. (6) Give an expression of the variance of the random variable X in terms of a, b and n. (7) Create a function called sample_Xs which takes as inputs a, b and n and outputs a sample X1 , X2 , · · · , Xn of independent copies of X. (8) Let a = 3, b = 5 and n = 100000. First, compute the numerical value of E(X) using the function compute_expectation_X and compute the numerical value of Var(X) using the function compute_variance_X. Second, use the function sample_Xs to generate a sample X1 , X2 , · · · , Xn of independent copies of X . With the generated sample, compute the sample mean X and sample variance. How close is the sample mean X to E(X)? How close is the sample variance to Var(X)? Explain your observation. Moreover, let µ := E(X) and σ := √Var(X)/n (the random variable X is defined above), and let fµ,σ : R → [0, ∞) be the probability density function of a Gaussian random variable with distribution N(µ, σ2 ), i.e., the expectation is µ and the variance is σ 2 . Next, conduct a simulation study to explore the behaviour of the sample mean X by answering questions (9)-(11). (9) Let a = 3, b = 5 and n = 100. Conduct a simulation study with 50000 trials. In each trial, generate a sample X1 , · · · , Xn of independent copies of X . For each of the 50000 trials, compute the corresponding sample mean X based on X1 , · · · , Xn. (10) Create a scatter plot of the points {(xi , fµ,σ (xi ))} where {xi } are a sequence of numbers between µ − 3σ and µ + 3σ in increments of 0.1σ . Then append to the scatter plot a curve representing the kernel density of the sample mean X within your simulation study (with 50000 trials). Use different colours for the point {(xi , fµ,σ (xi ))} and the density curve of the sample mean X . (11) Describe the relationship between the density of X and the function fµ,σ displayed in your plot. Try to explain the reason. Section C (50 marks) In this part of the assessment, you are asked to complete a Data Science report which demonstrates your understanding of a statistical method. The goal here is to choose a topic that you find interesting and explore that topic in depth. You are free to choose a topic and data set that interests you. There will be an opportunity to discuss and get advice on your chosen direction in the computer labs. Below are two flexible example structures you can consider for this section of your report. If you are unsure what to do, choose one of the following. Note that you should not submit more than one of the example tasks below. Example task 1 Investigate a particular hypothesis test e.g. a Binomial test, a paired Student’s t test, an unpaired Student’s t test, an F test for ANOVA, a Mann-Whitney U test, a Wilcoxon signed-rank test, a Kruskal Wallis test, or some other test you find interesting. Note that clarity of presentation is highly important. In addition, you should aim to demonstrate a depth of understanding. For this hypothesis test you are asked to do the following: 1. Give a clear description of the hypothesis test being considered, including the details of the test statistic and p-value, the underlying assumptions, the null hypothesis and the alternative hypothesis. Give an intuitive explanation for why the test statistic is useful in distinguishing between the null and the alternative. 2. Perform. a simulation study to investigate the probability of type I error under the null hypothesis for your hypothesis test. Your simulation study should involve randomly generated data which conforms to the null hypothesis. Compare the proportion of rounds where a Type I error is made with the significance level of the test. What happens when a different significance level is used? 3. Choose a suitable real-world data set (for example, some places to find data sets are described below). Ensure that your chosen data set is appropriate for your chosen hypothesis test. For example, if your chosen hypothesis test is an unpaired t-test then your chosen data set must have at least one continuous variable and contain at least two groups. It is recommended that your data set for this task not be too large. You should explain the source and the structure of your data set within your report. You should also explain the related problem on which you want to perform. the test. 4. Carefully discuss the appropriateness of your statistical test in this setting and how your hypotheses correspond to different aspects of the data set. You may want to use plots to demonstrate the validity of your underlying assumptions. Draw a statistical conclusion and report the value of your test statistic, the p-value and a suitable measure of effect size. 5. Discuss what scientific conclusions you can draw from your hypothesis test. Discuss how these would have differed if the result of your statistical test had differed. Discuss key experimental design considerations necessary for drawing any such scientific conclusion. For example, perhaps an alternative experimental design would have allowed one to draw a conclusion about cause and effect? 6. Exploring further this hypothesis test on one topic/direction of your choice. This could be for example discussing a property of the test such as how the power of the chosen test changes with sample size, significance level, or effect size. As another example, how robust is the test when assumptions are violated and is there a robust alternative? How does the test compare to its non-parametric alternatives? How does the frequentist test compare with its Bayesian alternative? These are just a few examples. Make a clear statement on the question of interest and your conclusions. The details of your approach to support your findings should be visible within your report, and experiments or simulation studies can be included if needed. Example task 2 Investigate a particular method for supervised learning. This could either be a method for regression or classification but should be a method with at least one tunable hyperparameter. You could choose one from ridge regression, k-nearest neighbour regression, a regression tree, regularized logistic regression, k-nearest neighbour classification, a decision tree, a random forest or another supervised learning technique you find interesting. Note that clarity of presentation is highly important. In addition, you should aim to demonstrate a depth of understanding. 1. Give a clear description of the supervised learning technique you will use, including the underlying principles and any assumptions. Explain how the training algorithm works and how new predictions are made on test data. Discuss what type of problems this method is appropriate for. 2. Choose a suitable data set where this method can be applied. Perform. a train, validation, and test split (for example, some places to find data sets are described below). Be careful to ensure that your data set is appropriate for your chosen algorithm. For example, if you have chosen to investigate a classification algorithm then your chosen data set must contain at least one categorical variable. Your data set for this task does not need to be large to obtain good results. The size of your data set should not exceed 100MB and you should aim to use a data set well within this limit. Your report should carefully give the source for your data. In addition, describe your data set. How many features are there? How many examples? What type is each of the variables (e.g. categorical, ordinal, continuous, binary etc.)? You should also explain the associated problem that you will solve using your supervised learning method. 3. What is an appropriate metric for the performance of your model? Give a clear explanation of the metric. Explore how the performance of your model varies on both the training data and the validation data as you vary the amount of training data used. You should compare the performance of the models across different sizes of the training data. 4. Explore how the performance of your model varies on both the training data and the validation data as you vary a hyperparameter. 5. Choose a hyper-parameter and report your performance based on the test data. Can you get a better understanding by using cross-validation? 6. Exploring further this supervised learning method on one topic/direction of your choice. This could be for example discussing how the bias-variance trade-off impacts the performance of the chosen method. As another example, is your model robust? How does the performance of the method change when applied to imbalanced datasets? Does your method work on small data and if not is there an suitable alternative? You could also investigate how different regularisation techniques affect the model’s performance, or carefully compare the chosen method with other methods. These are just a few examples. Make a clear statement on the question of interest and your conclusions. The details of your approach to support your findings should be visible within your report, and experiments or simulation studies can be included if needed. Further instruction for Section C. Note: 1. Do not complete and submit more than one of the above tasks. These are example tasks and you should only choose one. The goal here is to explore a topic in detail. 2. You will be graded on the level of understanding of the key concepts demonstrated within your report. Additional marks will be given for more advanced methods, provided that a very strong level of understanding is displayed. However, you should avoid choosing complex methods without properly demonstrating your understanding. The main focus here is a clear understanding and you should not sacrifice understanding for the sake of complexity. A clear understanding of the basic concepts is paramount. 3. You do not need to use large data sets. The dataset you choose should not be larger than 100MB. This is an upper bound. You should aim to use a data set well within this limit. 4. We expect that your approach should be visual and clear within the report itself. Therefore it is highly recommended to include pieces of clear and well-written code along with necessary comments and explanations within the report itself. 5. We expect that you interpret and make sense of the experiment results obtained, instead of displaying a list of the results without explanation or analysis. A high quality report should be able to use the experimental results to support its conclusions and findings in a consistent manner. 6. We do not have a page limit for the report. A rough guideline is that your report should ideally be no more than 10 pages, if all figures and large pieces of code were removed. However, this is not a strict constraint. Again, clarity is highly important, and you should include sufficient details to demonstrate your approach and the level of understanding of the key concepts.