People love to ask questions in social media—there are entire sites like Quora and StackOverflow dedicated to asking and answering questions. But what makes for a helpful response? Helpfulness can come in many forms: a story, a code snippet, a favorite book, sound advice, or more. In this homework, we’ll look specifically at this notion of a helpful answer. You will design annotation guidelines to rate replies to answers using a 5-point Likert scale from (1) Not Helpful to (5) Very helpful.Homework 3 is a two-part group homework focused on teaching you (1) designing annotation guidelines and measuring annotation quality and (2) building classifiers based on large language models. You’ll go through the full Machine Learning pipeline from initial unlabeled data to training classifiers. This data curation process is often overlooked as a simple step, but as a practitioner in the field, if you have to annotate your own data (or hire people to label for you), you’ll need these skills to ensure your eventual classifier performs well.This homework has the following learning goals: 1. Learn how to iteratively craft annotation guidelines 2. (Optional) Learn how to use annotation tools 3. Learn how to compute rates of inter-annotation agreement 4. Learn how to adjudicate disagreements during data labeling 5. Learn how to use the Huggingface Trainer infrastructure for training classifiers 6. Improve your skills at reading documentation and examples for advanced NLP methods 7. Learn how to use the Great Lakes cluster and access its GPUs 8. Understand how annotation agreement and guideline-divergence influence classification performanceThis homework is multi-part because we will use the annotations produced by all teams to train and analyze your classifiers. Roughly, the first two weeks will be spent developing your guidelines and annotating. Once all teams have finished, we will release the larger set of annotations and teams will develop their classifiers. Teams can get started doing the classifier development earlier using just their own data and then train the classifier on the final data.As you will see in this assignment, annotation is challenging and annotated data—especially, quality annotated data—is incredibly valuable. Based on the number of teams we have, SI 630 will have produced a new and interesting resource for NLP models: A rich set of thousands of responses labeled for their helpfulness and multiple detailed annotation guidelines for different interpretations of helpfulness, along with which guidelines were used for which labels.As a class, I would like us to publish our collective resource on helpful replies as a technical report on ArXiv with you all as authors (opt-in) where we share different guidelines and our annotations to contribute to the research community. You would not need to contribute anything beyond your anonymized guidelines and annotations to be an author. Writing and analysis contributions would move you up in the author order. If there’s interest we could even try to publish the technical report as well, which has been done for past course projects, e.g., Kenyon-Dean et al. [2018]1 You all are doing amazing and it’s important to realize that you have the ability to advance the state of NLP and data science in general with the skills. The technical report would be a way to document your contribution.3 Forming a Team You will design your initial annotation guidelines in teams. Therefore, the very first step that is needed is to form a team of 2 to 3 people. There are no single-person teams or teams of four or more students. Register your team at this link: https://forms.gle/wzjHvSLJc2TBVRdN7. Once registered, you will receive your team’s data items to annotate.Note that this form also asks if your group would allow its anonymized guidelines and annotations to be included in the class technical report. While totally optional, what you are creating as a part of the class has real value and I would encourage you to share your knowledge with the community, which you’ll get authorship credit for. If you have any questions, please feel free to reach out to David on Piazza. ■ Problem 1. Register your team using this link https://forms.gle/wzjHvSLJc2TBVRdN7 to receive your specific annotation assignments.4.1 Designing Annotation Guidelines Helpful replies can take on many different forms. How should you design your guidelines to label each reply. What makes a comment a 1 versus a 2 or a 4 versus a 5? We’ve included a large sample of questions and replies from Reddit to get you started thinking about what qualities to look for. As a first step, each team member should write up their own guidelines for what makes a helpful response. Your guidelines should clearly define how any annotator should judge helpfulness along 1https://aclanthology.org/N18-1171/ 2 each point in the 5-point Likert scale.2 It is often helpful to include examples and descriptions of what is (or is not) an example of each scale point.Your individual guidelines should be at least one page, single spaced using one inch margins. You will likely end up writing more than one page and writing your thoughts/notes now will be helpful when putting together your group’s guidelines. Be sure to draft your guidelines independently without talking to your teammates. These independent thoughts on how to annotate are incredibly useful for framing and will make for a good discussion with your team since you will have each thought about how to design guidelines and where to make distinctions in helpfulness. You will have your own definition of helpfulness so writing it down (as guidelines) before meeting with the group will be important for expressing yourself.■ Problem 2. (15 points) Each team member drafts their own annotation guidelines. Each person should upload their own guideline to Canvas. Your guidelines should not have any identifying information in it (e.g., your name or uniqname).4.2 Designing Annotation Guidelines Once each team member has drafted their own annotation guidelines, you will meet as a team and collaboratively construct a combined annotation guideline. Try annotating some of the data together with it to see where your scales differ or what example you should move. Does it cover all the cases you think it should? Use your insight from this process to revise the team’s annotation. We strongly recommend repeatedly randomly sampling a small number of examples, rating them separately using the team’s guidelines, and then discussing disagreements to revise the guideline further.Your annotation guidelines should provide clear criteria that an annotator could use to determine where any comment goes on the rating scale. You should all agree on the label distinctions. We highly recommend an iterative process where group members meet, independently annotate a few items using the group’s guidebook, and then reconvene to discuss any disagreements and how the guidebook could be improved or refined to prevent those in the future. We recommend providing guidelines that are at least four pages.3■ Problem 3. (15 points) Draft annotation guide that will be used as a group. Upload the group’s guidelines to Canvas. Because we will redistribute these documents, your guidelines should not have any identifying information in it (e.g., your names or uniqnames). Specify the group members via Canvas group. 2All teams will use this 5-point scale, so you must adopt this in your own task.If you are initially not sure how a guidebook can be that long, do a series of group annotations to see disagreements and then write clear criteria/qualities and examples to help annotators avoid those disagreements. Documentation is key!4.3 Annotating Your Data Once your team has decided on a set of annotation guidelines, each team member will use the guidelines individually to rate your assigned items. Do your best to follow the guidelines and feel free to make notes on the annotation process (or guidelines) during annotation. You’ll use these notes later when you recommend updates.You are welcome to use any annotation software you want. Perhaps the simplest option is to load the data into a Google Sheet creating a new column called “Rating” which will make it easier to aggregate these ratings later. You can then work through each item and easily download the annotated data as a .csv for use later in training. If you want to get more sophisticated, you can also turn on Data Validation (Data -¿ Data Validation in the menu) and provide a list of options to annotate. Another option is to use a fancier annotation tool like Potato (https://github.com/davidjurgens/potato) or any of the others mentioned in Week 7 in class (e.g., LightTag, Labelstudio, Prodigy, etc.). In general, you should find a tool and workflow that maximizes annotation throughput while helping you prevent mistakes.■ Problem 4. (15 points) Individually label the data using the group’s guidelines. You are required to follow the Academic Honesty code and not discuss your annotations with each other. ■ Problem 5. Once finished annotating, collect your group’s annotations into a single .csv file in the format like the following id,annotator,rating s987u2a,user1,3 s987u2a,user2,2 s987u2a,user3,3 g902a0f,user1,1 g902a0f,user2,1 g902a0f,user3,2 and upload your file to Canvas. Your file should have no additional columns and should load easily using pd.read csv(“yourfile.csv”). Failure to do this will result in lost points. We will eventually distribute these files and anonymize the usernames so you do not need to anonymize them yourselves.4.4 Measuring Inter-Annotator Agreement Once you have finished your annotation, it’s now time to measure quality! You will measure annotation agreement in two ways: (1) Pearson’s correlation r and (2) Krippendorff’s α. For the latter, you should use a Python package and we recommend either https://github.com/ pln-fing-udelar/fast-krippendorff or https://github.com/grrrr/krippendorff-alpha.■ Problem 6. (5 points) Perform the following agreement analyses:1. Compute r and the compute α using the ordinal and nominal level of measurements for the group member’s annotations and report them (three scores total). 2. In 2-3 sentences, comment on the difference (if any) between your group r and the α scores. Which is higher and what do you think this means? 3. In 2-3 sentences, comment on the difference (if any) between your group’s ordinal and nominal α scores. Which is higher and what do you think this means? Which one should you use in practice to measure agreement in this setting?■ Problem 7. (5 points) Once all the annotations are released, compute the agreement of all the other annotators on your group’s items. In a 2-3 sentences, comment on the difference (if any) between your group’s and other students’ agreement and what you think is the cause. We recommend looking at other groups’ guidelines at this point.4.5 Examining Disagreements Once all of the data is released, we’ll look for places where your group disagreed with other students’ decisions. Conveniently, each item will have been annotated by two groups, each with their own guidelines. What led to the disagreements? Was it differences in guidelines? Differences in annotators’ interpretations? Fatigue? Lexical ambiguity? There are many reasons and this part of the assignment will have you grappling with the challenge of defining truth and understanding why some annotators might disagree.For each item, compute the mean score for the group’s ratings and the mean score for another group of students. We’ll treat the mean as the group’s estimate, recognizing that there are other ways to chose a final label. Examine the 10 instances that have the biggest absolute difference in mean rating between your group and the other group. Often this type of review process is helpful for identifying gaps in your guidelines.■ Problem 8. (5 points) In your report, include the text from each of these 10 replies and what the ratings were for each group. Describe why you think the two were different, ideally pointing to differences in the two group’s guidelines. After a discussion with your group members, adjudicate the rating and decide what is the final “true” rating, examining all the evidence. If you changed your score, in one sentence, state what led you to do this.■ Problem 9. (5 points) Based on your analysis of annotator agreement and the disagreements and the different annotation guidelines, provide at least five specific improvements that could be made to your own guidelines to improve annotator agreement. Improvements should be a series of bullet points, each describing in at least 1-2 sentences the changes that would be made to increase annotator agreement or correct for gaps and edge cases.Part 2 will have you developing classifiers using the very powerful Hugging Face library.4 Modern NLP has been driven by advancements in Large Language Models (LLMs) which, like we discussed in Week 3, learn to predict the word sequences. Substantial amounts of research have shown that once the models learn to “recognize language,” the parameters in these models (the weights in the neural network) can quickly be adapted to accomplish many NLP tasks. We’ll revisit the task from Homework 1: classification.Most of the work on LLMs has focused on BERT or RoBERTa, which are large masked language models. These models have been shown to generalize well to a variety of new tasks. However, the models have billions of parameters which can make them slow to train. As a result, another branch of work has focused on making these language models smaller—can we distill a smaller model from what the larger model has learned, or, start with a smaller model and use different techniques to have it learn the same thing. The “BERT family” of models all have the same structure of inputs and outputs (and core neural architecture: the Transformer network) but vary in how they’re trained and how many neurons they use. Thankfully in the huggingface API, all of these models can be accessed in the same way; the end-user (you) just needs to specify the model family and the parameters to plug in. For this assignment we will use a much smaller but nearly-as-performant version of BERT, https://huggingface.co/ microsoft/MiniLM-L12-H384-uncased, to train our models.This part of the assignment will have you using Great Lakes cluster, which has provided free GPUs for us to use.5 You will need to use Great Lakes for this assignment to be able to train the deep learning models. We have made it an explicit learning goal as well, so that you are comfortable with the system for Homework 4 and using the cluster for your course projects.To train your classifier, you have a few options. The simplest is to treat each scale point as a class and do multi-class classification. However, as you might notice, many of the ratings are not exactly all whole numbers so you would need to round, which loses some information. Further, if the answer is scored a 5 and the model predicts a 4, that prediction is just as wrong as a prediction of 1 in the classification setting. Therefore, a slightly more aligned model would be a regression model. Regression models frequently use a different loss, commonly Mean Squared Errors, unless how we used Cross-Entropy Loss (or Binary Cross Entropy Loss) in previous homework. There are many other options too, depending on how exotic you want to get. You are free to use whatever model and loss you want and we’ll see how well they perform.■ Problem 10. If you don’t have an account on Great Lakes, request one now at https:// arc.umich.edu/login-request For training your LLM-based classifier, we’ll use the Trainer class6 from huggingface which wraps much of the core training loop into a single function call and support many (many) extensions and optimizations like using adaptive learning rates or stopping your model if it converges to avoid overtraining. The class’s documentation has many options, which can be a it overwhelming at first. We recommend searching out examples and blog posts on how to use the class; finding 4https://huggingface.co/ 5https://arc.umich.edu/greatlakes/user-guide/ 6https://huggingface.co/docs/transformers/main_classes/trainerand interpreting these kinds of informal documentation is a learning goal for this assignment. As you progress in your technical career, honing your skills as quickly finding and making use of these (likely, as you did for StackOverflow) will be critical for using state of the art NLP models■ Problem 11. (15 points) Develop your code using huggingface’s Trainer class to train a classifier or regressor to predict the helpfulness rating of an answer. Submit your code to Canvas and report your score on the development dataset in your team’s report.For this homework, we’ll be trying the most stringent form of evaluation. You are welcomed and encouraged to do extensive testing of your model on the development data. However, you will get only one shot at predicting the test data. The motivation for this setup, is that it is the most rigorous test of how well your model generalizes to unseen data.■ Problem 12. (10 points) Use your trained huggingface model to predict the helpfulness score on the test dataset and submit it to Kaggle. Your grade will depend, in part, on how well your model performs. You only get one submission to Kaggle, so make it count!5.1 Annotation Evaluation via Classification Once you have the code for training the model ready, we can also put it to use to examine how annotators and guidelines might have influenced the model performance. Specifically, you’ll perform annotator ablation tests where we hold out the annotations of some groups of annotators and see how (1) the model performance changes and (2) in the validation set, whether model predictions are more or less similar to those annotators’ labels.You’ll do the following procedure for each group gi ∈ G: • Remove all the annotations by any annotator in gi from the training data • Create ground truth data by averaging the scores of all remaining annotators in the training data • Train your helpfulness prediction model on these aggregated scores • For the development data, split into three sets: (a) replies without any annotators in gi , (b) replies to items with annotators in gi and using their annotations (c) the same replies as b but using the annotations by the individuals not in gi .The first set will measure how well the model is doing in general. The difference between (b) and (c) is whether the model is more able to predict the scores of group gi (even though it wasn’t trained on this data) or whether the model’s predictions are closer to everyone else’s predictions • For the three development sets, take the average score of the annotators as the ground truth. • Estimate your model’s performance using Pearson’s correlation on each of the validatation subsets.■ Problem 13. (10 points) Make a bar plot of the correlation scores of each group for the three evaluation subsets in the format shown in Figure 1. In a paragraph, describe what you see and 7 Group1 Group2 Group 0.0 0.2 0.4 0.6 0.8 Correlation Subset a b c Figure 1: An example of the format of the figure comparing classifier performance when specific group’s annotations are held-out and evaluated, described in Section 5.1. This figure is shown for two groups; your figure should show all groups. If space is too tight, you can split all the groups into separate figures with multiple groups per subfigure.discuss the relative impact of each team’s annotations. For example, did some group’s annotations cause the model to perform worse? Were some group’s labels closer to the model’s predictions? In your discussion, review the relevant groups’ guidelines and describe whether specific instructions in their guidelines might be the cause of these differences in scores/performance.6 What to submit All group-based submissions should be done by forming a group in the Canvas assignment like you do for the lecture notes. • Each person submits their own guidelines to Canvas • One person submits the group’s annotation guidelines to Canvas • One person submits the group’s annotations • One person submits the group’s implementation of the helpfulness prediction model • One person submits a single PDF document with the group’s answers and plots to all questions in this document. • One person should submit their group’s predictions for the test set on Kaggle (NB: you only get one upload!)References Kian Kenyon-Dean, Eisha Ahmed, Scott Fujimoto, Jeremy Georges-Filteau, Christopher Glasz, Barleen Kaur, Auguste Lalande, Shruti Bhanderi, Robert Belfer, Nirmal Kanagasabai, Roman Sarrazingendron, Rohit Verma, and Derek Ruths. Sentiment analysis: It’s complicated! In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1886–1895, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/ N18-1171. URL https://aclanthology.org/N18-1171.
How do we represent word meaning so that we can analyze it, compare different words’ meanings, and use these representations in NLP tasks? One way to learn word meaning is to find regularities in how a word is used. Two words that appear in very similar contexts probably mean similar things. One way you could capture these contexts is to simply count which words appeared nearby. If we had a vocabulary of V words, we would end up with each word being represented as a vector of length |V | 1 where for a word wi , each dimension j in wi’s vector, wi,j refers to how many times wj appeared in a context where wi was used.The simple counting model we described actually words pretty well as a baseline! However, it has two major drawbacks. First, if we have a lot of text and a big vocabulary, our word vector representations become very expensive to compute and store. A 1,000 words that all co-occur with some frequency would take a matrix of size |V | 2 , which has a million elements! Even though not all words will co-occur in practice, when we have hundreds of thousands of words, the matrix can become infeasible to compute.Second, this count-based representation has a lot of redundancy in it. If “ocean” and “sea” appear in similar contexts, we probably don’t need the co-occurrence counts for all |V | words to tell us they are synonyms. In mathematics terms, we’re trying to find a lower-rank matrix that doesn’t need all |V | dimensions. Word embeddings solve both of these problems by trying to encode the kinds of contexts a word appears in as a low-dimensional vector. There are many (many) solutions for how to find lowerdimensional representations, with some of the earliest and successful ones being based on the Singular Value Decomposition (SVD); one you may have heard of is Latent Semantic Analysis. In Homework 2, you’ll learn about a relatively recent technique, word2vec, that outperforms prior approaches for a wide variety of NLP tasks and is very widely used. This homework will build on your experience with stochastic gradient descent (SGD) and log-likelihood (LL) from Homework 1. You’ll (1) implement a basic version of word2vec that will learn word representations and then (2) try using those representations in intrinsic tasks that measure word similarity and an extrinsic task for sentiment analysis.For this homework, we’ve provided skeleton code in Python 3 that you can use to finish the implementation of word2vec and comments within to help hint at how to turn some of the math into python code. You’ll want to start early on this homework so you can familiarize yourself with the code and implement each part. This homework has the following learning goals: 1You’ll often just see the number of words in a vocabulary abbreviated as V in blog posts and papers. This notation is shorthand; typically V is somehow related to vocabulary so you can use your judgment on how to interpret whether it’s referring to the set of words or referring to the total number of words.1 • Develop your pytorch programming skills through working with more of the library • Learn how word2vec works in practice • Learn how to modify the loss to have networks learn (or avoid learning) multiple things • Improve your advanced data science debugging skills • Have you work with large corpora • Identify bias in pretrained models and attempt to mitigate it • Learn how to use tensorboard This homework is a mix of conceptual and skills based learning. As you get the hang of programming neural networks, you’ll be able to teach them to do many more advanced tasks. This homework will hopefully help prepare you by again having you advance your skills while also getting you thinking about what training word embeddings can do for us (as practitioners).2 Notes We’ve made the implementation easy to follow and avoided some of the useful-to-opaque optimizations that can make the code much faster.2 As a result, training your model may take some time. We estimate that on a regular laptop, it might take 30-45 minutes to finish training a single epoch of your model. That said, you can still quickly run the model for ∼10K steps in a few minutes and check whether it’s working. A good way to check is to see what words are most similar to some high frequency words, e.g., “january” or “good.” If the model is working, similar-meaning words should have similar vector representations, which will be reflected in the most similar word lists. We have included this as an automated test which will print out the most similar words.The skeleton code also includes methods for writing word2vec data in a common format readable by the Gensim library. This means you can save your model and load the data with any other common libraries that work with word2vec. Once you’re able to run your model for ∼100K iterations (or more), we recommend saving a copy of its vectors and loading them in a notebook to test. We’ve included an exploratory notebook. On a final note, this is the most challenging homework in the class. Much of your time will be spent on Task 1, which is just implementing word2vec. It’s a hard but incredibly rewarding homework and the process of doing the homework will help turn you into a world-class information and data scientist! 3 Data For data, we’ll be using a sample of cleaned Wikipedia biographies that’s been shrunk down to make it manageable. This is pretty fun data to use since it lets us use word vectors to probe for 2You’ll also find a lot of wrong implementations of word2vec online, if you go looking. Those implementations will be much slower and produce worse vectors. Beware! 2 knowledge (e.g., what’s similar to chemistry?). If you’re very ambitious, we’ve include the full cleaned biographies which will be slow. Feel free to see how the model works and whether you can get through a single epoch! We’ve provided several files for you to use: 1. wiki-bios.med.txt – Eventually train your word2vec model on this data 2. wiki-bios.DEBUG.txt – The first 100 biographies. Your model won’t learn much from this but you can use the file to quickly test and debug your code without having to wait for the tokenization to finish.3. wiki-bios.10k.txt – A smaller sample of 10K biographies that you can use to verify your word2vec is learnin something without having to train on the final corpus. 4. wiki-bios.HUGE.txt – A much larger dataset of biographies. You should only use this if you want to test scalability or see how much you can optimize your method. 5. word pair similarity predictions.csv – This is the data for the debiasing evaluation that you’ll use to estimate similarities for CodaLab. You only need this for the very last part of the homework.4 Task 1: Word2vec In Task 1, you’ll implement parts of word2vec in various stages. Word2vec itself is a complex piece of software and you won’t be implementing all the features in this homework. In particular, you will implement: 1. Skip-gram negative sampling (you might see this as SGNS) 2. Rare word removal 3. Frequent word subsampling You’ll spend the majority of your time on Part 1 of that list which involves writing the gradient descent part. You’ll start by getting the core part of the algorithm up without parts 2 and 3 and running with gradient descent and using negative sampling to generate output data that is incorrect. Then, you’ll work on ways to speed up the efficiency and quality by removing overly common words and removing rare words. Parameters and notation The vocabulary size is V , and the hidden layer size is k.The hidden layer size k is a hyperparameter that will determine the size of our embeddings. The units on these adjacent layers are fully connected. The input is a one-hot encoded vector x, which means for a given input context word, only one out of V units, {x1, . . . , xV }, will be 1, and all other units are 0. The output layer consists of a number of context words which are also V -dimensional one-hot encodings of a number of words before and after the input word in the sequence. So if your input word was word w in a sequence of text and you have a context window3 ±2, this 3Typically, when describing a window around a word, we use negative indices to refer to words before the target, so a ±2 window around index i starts at i − 2 and ends at i + 2 but excludes index i. 3 means you will have four V -dimensional one-hot outputs in your output layer, each encoding words w−2, w−1, w+1, w+2 respectively. Unlike the input-hidden layer weights, the hidden-output layer weights are shared: the weight matrix that connects the hidden layer to output word wj will be the same one that connects to output word wk for all context words.The weights between the input layer and the hidden layer can be represented by a V ×k matrix W and the weights between the hidden layer and each of the output contexts similarly represented as C with the same dimensions. Each row of W is the k-dimension embedded representation vI of the associated word wI of the input layer—these rows are effectively the word embeddings we want to produce with word2vec. Let input word wI have one-hot encoding x and h be the output produced at the hidden layer.Then, we have: h = WT x = vI (1) Similarly, vI acts as an input to the second weight matrix C to produce the output neurons which will be the same for all context words in the context window. That is, each output word vector is: u = Ch (2) and for a specific word wj , we have the corresponding embedding in C as v ′ j and the corresponding neuron in the output layer gets uj as its input where: uj = v ′T j h (3) Note that in both of these cases, multiplying the one-hot vector for a word wi by the corresponding matrix is the same thing has simply selecting the row of the matrix corresponding to the embedding for wi . If it helps to think about this visually, think about the case for the inputs to the network: the one-hot embedding represents which word is the center word, with all other words not being present. As a result, their inputs are zero and never contribute to the activation of the hidden layer (only the center word does!), so we don’t need to even do the multiplication. In practice, we typically never represent these one-hot vectors for word2vec as it’s much more efficient to simply select the appropriate row.An unoptimized, naive version of word2vec would predict which context word wc was present given an input word wI by estimating the probabilities across the whole vocabulary using the softmax function: P(wc = w ∗ c |wI ) = yc = exp(uc) PV i=1 exp(ui) (4) This original log-likelihood function is then to maximize the probability that the context words (in this case, w−2, . . . , w+2) were all guessed correctly given the input word wI . Note that you are not implementing this function! Showing this function raises two important questions (1) why is it still being described and (2) why aren’t you implementing it? First, the equation represents an ideal case of what the model should be doing: given some positive value to predict for one of the outputs (wc), everything else should be close to zero.This objective is similar to the likelihood you implemented for Logistic Regression: given some input, the weights need to be moved to push the predictions closer to 0 or closer to 1. However, think about how many weights you’d need to update to minimize 4 this particular log-likelihood? For each positive prediction, you’d need to update |V | − 1 other vectors to make their predictions closer to 0. That strategy which uses the softmax results a huge computational overhead—despite being the most conceptually sound.The success of word2vec is, in part, due to coming up with a smart way to achieve nearly the same result without having to apply the softmax. Therefore, to answer the second question, now that you know what the goal is, you’ll be implementing a far more efficient method known as negative sampling that will approximate creating a model that minimizes this equation! If you read the original word2vec paper, you might find some of the notation hard to follow. Thankfully, several papers have tried to unpack the paper in a more accessible format. If you want another description of how the algorithm works, try reading Goldberg and Levy [2014]4 or Rong [2014]5 for more explanation. There are also plenty of good blog tutorials for how word2vec works and you’re welcome to consult those6 as well as some online demos that show how things work.7 .There’s also a very nice illustrated guide to word2vec https://jalammar.github. io/illustrated-word2vec/ that can provide more intuition too. 4.1 Getting Started: Preparing the Corpus Before we can even start training, we’ll need to determine the vocabulary of the input text and then convert the text into a sequence of IDs that reflect which input neuron corresponds to which word. Word2vec typically treats all text as one long sequence, which ignores sentences boundaries, document boundaries, or otherwise-useful markers of discourse. We too will follow suit. In the code, you’ll see general instructions on which steps are needed to (1) create a mapping of word to ID and (2) processing the input sequence of tokens and covert it to a sequence of IDs that we can use for training. This sequence of IDs is what we’ll use to create our training data. As a part of this process, we’ll also keep track of all the token frequencies in our vocabulary. ■ Problem 1. Modify function load data in the Corpus class to read in the text data and fill in the id to word, word to id, and full token sequence as ids fields. You can safely skip the rare word removal and subsampling for now.4.2 Negative sampling For a target word, the nearby words in the context form the positive example for training our prediction model. Rather than train word2vec like a regular mutliclass classification model (which uses the softmax function to predict outputs8 ), word2vec uses a small number of randomly-selected 4https://arxiv.org/pdf/1402.3722.pdf 5https://arxiv.org/pdf/1411.2738.pdf 6E.g., http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 7https://ronxin.github.io/wevi/ 8When using the softmax, you would update the parameters (embeddings) for all the words after seeing each training instance. However, consider how many parameters we have to adjust: for one prediction, we would need to change |V |N weights—this is expensive to do! Mikolov et al. proposed a slightly different update rule to speed things up. Instead of updating all the weights, we update only a small percentage by updating the weights for the predictions of the words in context and then performing negative sampling to choose a few words at random as negative examples of words in the context (i.e., words that shouldn’t be predicted to be in the context) and updating the weights for these 5 words as negative examples.9 These negative examples are referred to as the negative samples.The negative samples are chosen using a unigram distribution raised to the 3 4 power: Each word is given a weight equal to its frequency (word count) raised to the 3 4 power. The probability for a selecting a word is just its weight divided by the sum of weights for all words. The decision to raise the frequency to the 3 4 power is fairly empirical and this function was reported in their paper to outperform other ways of biasing the negative sampling towards infrequent words. Computing this function each time we sample a negative example is expensive, so one important implementation efficiency is to create a table so that we can quickly sample words. We’ve provided some notes in the code and your job will be to fill in a table that can be efficiently sampled.10■ Problem 2. Modify function generate negative sampling table to create the negative sampling table. 4.3 Generating the Training Data Once you have the tokens in place, the next step is get the training data in place to actually train the model. Say we have the input word “fox” and observed context word “quick”. When training the network on the word pair (“fox”, “quick”), we want the model to predict an output of 1 signalling this word (“quick”) was present in the context. With negative sampling, we are will randomly select a small number of negative examples (let’s say 2) for each positive example to update the weights for. (In this context, a negative example is one for which we want the network to output a 0 for). When updating the model (later), our parameters will be updated on our current ability to predict 1 for the positive examples and 0 for the negative examples. To generate the training, you’ll iterate through all token IDs in the sequence. At each time step, the current token ID will become the target word. You’ll use the window size parameter to decide how many nearby tokens should be included as positive training examples. The original word2vec paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets. In this assignment, you will update with 2 negative words per context word. This means that if your context window selects four words, you will randomly sample 8 words as negative examples of context words. We recommend keeping the negative sampling rate at 2, but you’re welcome to try changing this and seeings its effect (we recommend doing this after you’ve completed the main assignment). Note: There is one important PyTorch-related wrinkle that you will need to account for, which is described in detail in the code. ■Problem 3. Generate the list of training instances according to the specifications in the code. negative predictions. 9There is another formulation of word2vec that uses a hierarchical softmax to speed up the softmax computation (which is the bottleneck) but few use this in practice. 10Hint: In the slides, we showed how to sample from a multinomial (e.g., a dice with different weights per side) by turning it into a distribution that can be sampled by choosing a random number in [0,1]. You’ll be doing something similar here. 6 4.4 Define Your word2vec Network Now that the data is ready, we can define our PyTorch neural network for word2vec. Here, we will not use layers but instead use PyTorch’s Embedding class to keep track of our target word and context word embeddings.■ Problem 4. Modify the init weights function to initialize the values in the two Embedding objects based on the size of the vocabulary |V | and the size of the embeddings. Unlike in logistic regression where we initialized our β vector be zeros, here, we’ll initialize the weights to have small non-zero values centered on zero and sampled from (-init range, init range).11 The next step is to update the forward function, which takes as input some target word and context words and predicts 0 or 1 for whether each context word was present. Formally, for some target word vector vt and context word vector vc, word2vec makes its predictions as σ(vt · vc) (5) where σ is the sigmoid function (like in Homework 1). Word2vec aims to learn parameters (its two embedding matrices) such that this function is maximized for positive examples and minimized for negative examples.■ Problem 5. Modify the forward function 4.5 Train Your Model Once you have the data in the right format, you’re ready to train your model! You will need to implement the core training loop like you did in Homework 1, where you iterate over all the instances in a single epoch and potentially train for multiple epochs. One key difference this time is that you will use batching. In Homework 1 we had a stark contrast between (1) full gradient descent where a single step required us to compute the gradient with respect to all the data and (2) stochastic gradient descent where take a step based on the prediction error for a single instance. However, there is a middle ground! Often we can improve the gradient by computing it with respect to a few instances instead of just one. Analogously, consider if you wanted to know if you were on the right track, it can help to ask a few folks, but you don’t need to ask everyone (and asking just one person could be risky and send you on the wrong track).Batched gradient descent is the same way. Conveniently, PyTorch works nearly seamlessly with batching. We can tell the DataLoader class our batch size and it will return a random sample of instances of that size. The code you write for the forward function will also work with a batch too with no modifications (most of the time). This behavior is even better for us because often computers are much faster at larger computations—especially GPUs—so trying to do the forward/backward passes for an entire batch is often just as fast as doing them for a single instance.Note: One caveat to things just working is that sometimes your forward-pass code will be set up so that it can’t work with batching. The code hints and description in the notebook will hopefully help you avoid these, but we’re also here to support you in Piazza. 11Why initialize this way? Consider what would happen if our initial matrices were all zero and we had to compute the inner product of the word and context vectors. The value would always be zero and the model would never be able to learn anything! 7 In your implementation we recommend starting with these default parameter values: • batch size = 16 • k = 50 (embedding size) • η = 5e − 5 (learning rate) • window ±2 • min token freq = 5 • epochs = 1 • optimizer = AdamW You can experiment around with other values to see how it affects your results. Your final submission should use a batch size > 1. For more details on the equations and details of word2vec, consult Rong’s paper [Rong, 2014], especially Equations 59 and 61.■ Problem 6. Modify the cell containing the training loop to complete the required PyTorch training process. The notebook describes in more details all the steps ■ Problem 7. Check that your model actually works. We recommend running your code on the wiki-bios.10k.txt file for one epoch. After this much data, your model should know enough for common words that the nearest neighbors (words with the most similar vectors) to words like “january” will be month-related words. We’ve provided code at the end of the notebook to explore. Try a few examples and convince yourself that your model/code is working. Once you’re finished here, you’re not yet ready to run everything but you’re close! 4.6 Implement stop-word and rare-word removal Using all the unique words in your source corpus is often not necessary, especially when considering words that convey very little semantic meaning like “the”, “of”, “we”.As a preprocessing step, it can be helpful to remove any instance of these so-called “stop words”. Note that when you remove stop words, you should keep track of their position so that the context doesn’t include words outside of the window. This means that a sentence with “my big cats of the kind that…” if you have a context window of ±2, then you would only have “my” and ”big” as context words (since “of” and “the” get removed) and not include “kind.” 4.6.1 Minimum frequency threshold. In addition to removing words that are so frequent that they have little semantic value for comparison purposes, it is also often a good idea to remove words that are so infrequent that they are likely very unusual words or words that don’t occur often enough to get sufficient training during SGD. While the minimum frequency can vary depending on your source corpus and requirements, we will set min count = 5 as the default in this assignment. Instead of just removing words that had less than min count occurrences, we will replace these all with a unique token . In the training phase, you will skip over any input word that is but you will still keep these as possible context words.■ Problem 8. Modify function load data to convert all words with less than min count occurrences into tokens. Modify your dataset generation code to avoid creating a training instance when the target word is . 4.6.2 Frequent word subsampling Words appear with varying frequencies: some words like “the” are very common, whereas others are quite rare. In the current setup, most of our positive training examples will be for predicting very common words as context words. These examples don’t add much to learning since they appear in many contexts.The word2vec library offers an alternative to ensure that contexts are more likely to have meaningful words. When creating the sequence of words for training (i.e., what goes in full token sequence as ids), the software will randomly drop words based on their frequency so that more common words are less likely to be included in the sequence. This subsampling effectively increases the context window too—because the context window is defined with respect to full token sequence as ids (not the original text), dropping a nearby common words means the context gets expanded to include the next-nearest word that was not dropped. To determine whether a token in full token sequence as ids should be subsampled, the word2vec software uses this equation to compute the probability pk(wi) of a token for word wi being kept in for training: pk(wi) = r p(wi) 0.001 + 1! · 0.001 p(wi) (6) where p(wi) is the probability of the word appearing in the corpus initially. Using this probability, each occurrence of wi in the sequence is randomly decided to be kept or removed based on pk(wi).■ Problem 9. Modify function load data to compute the probability pk(wi) of being kept during subsampling for each word wi . ■ Problem 10. Modify function load data so that after the initial full token sequence as ids is constructed, tokens are subsampled (i.e., removed) according to their probability of being kept pk(wi). 4.7 Getting Tensorboard Running As you might guess, training word2vec on a lot of data can take some time. This waiting process will be increasingly true as you train larger and larger models (not just word2vec). However, the larger pytorch ecosystem provides some fantastic tools for you, the practitioner, to monitor the progress. In this subtask, you’ll be using one of those tools, tensorboard, that allows you to log how your model is doing and then you can connect to the tensorboard interface and see the plot. Figure 1 shows an example of the tensorboard plot for our reference implementation after one epoch of training. Here, we’ve just recorded a running sum of the loss every 100 steps. You will want to do the same. This will help you see how quickly your model is converging. If you train multiple models, tensorboard will show all of their training plots so you can see how your choice in hyperparameters affects training speed and which model as learned the most 9 Figure 1: An example tensorboard run from the reference solution where the running sum of loss is reported every 100 steps (i.e., the sum of those steps’ loss) across one epoch on the training data. Hovering over any point shows the loss at that time, as well as the relative wall-clock time. As you can see, after one epoch the model as learned something but has probably not fully converged! (has the lowest loss). In practice, many people use tensorboard to determine when to stop training after seeing at their model has effectively converged.■ Problem 11. Add tensorboard logging to your training loop so that you keep track of the sum of the losses for the past 100 steps and record the value with tensorboard. 4.8 Train Your Final Model All the pieces are now in place and you can verify the model has learned something. For your final vectors, we’ll have you train on at least one epoch. Before you do that, we’ll have you do one quick exploration to see how batch size impacts training speed.■ Problem 12. Try batch sizes of 2, 8, 32, 64, 128, 256, 512 to see how fast each step (one batch worth of updates) is and the total estimated time. For this, you’ll set the parameter and then run the training long enough to get an estimate for both with tqdm wrapped around your batch iterator. You do not need to finish training for the full epoch. Make a plot where batch size is on the x-axis and the tqdm-estimated time to finish one epoch is on the y-axis. (You may want to log-scale one or both of the axes). You can try other batch sizes too in this plot if you’re curious. In your write up, describe what you see. What batch size would you choose to maximize speed? Side note: You might also want to watch your memory usage, as larger batches can sometimes dramatically increase memory.■ Problem 13. Train your model on at least one epoch worth of data. You are welcome to change the hyperparameters as you see fit for your final model (although batch size must be > 1. Record the full training process and save a picture of the tensorboard plot from your training run in your report. We need to see the plot. It will probably look something like Figure 1. 10 5 Task 2: Save Your Outputs Once you’ve finished training the model for at least one epoch, save your vector outputs. The rest of the homework will use these vectors so you don’t have even re-run the learning code (until the very last part, but ignore that for now). Task 2 is here just so that you have an explicit reminder to save your vectors. We’ve provided a function to do this for you.6 Task 3: Qualitative Evaluation of Word Similarities Once you’ve learned the word2vec embeddings from how a word is used in context new we can use them! How can we tell whether what it’s learned is useful? As a part of training, we put in place code that shows the nearest neighbors, which is often a good indication of whether words that we think are similar end up getting similar representations. However, it’s often better to get a more quantitative estimate of similarity. In Task 3, we’ll begin evaluating the model by hand by looking at which words are most similar another word based on their vectors. Here, we’ll compare words using the cosine similarity between their vectors. Cosine similarity measures the angle between two vectors and in our case, words that have similar vectors end up having similar (or at least related) meanings. ■ Problem 14. Load the model (vectors) you saved in Task 2 by using the Jupyter notebook provided (or code that does something similar) that uses the Gensim package to read the vectors. Gensim has a number of useful utilities for working with pretrained vectors. ■ Problem 15. Pick 10 target words and compute the most similar for each using Gensim’s function. Qualitatively looking at the most similar words for each target word, do these predicted word seem to be semantically similar to the target word? Describe what you see in 2-3 sentences. Hint: For maximum effect, try picking words across a range of frequencies (common, occasional, rare words). ■ Problem 16. Given the analogy function, find five interesting word analogies with your word2vec model. For example, when representing each word by word vectors, we can generate the following equation, king – man + woman = queen. In other word, you can understand the equation as queen – woman = king – man, which mean the vectors similarity between queen and women is equal to king and man. What kinds of other analogies can you find? (NOTE: Any analogies shown in the class recording cannot be used for this problem.) What approaches worked and what approaches didn’t? Write 2-3 seconds in a cell in the notebook. 7 Task 4: Debiasing word2vec Once you have completed all other steps, only then start on Task 4! Methods that learn word meaning from observing how words are used in practice are known to pick up on latent biases in the data. As a result, these biases persist in the vectors and any 11 downstream applications that use them—something we don’t want if, for example, the vectors are used in an NLP program screening resumes for whether to interview. In Task 4, we’ll try our hand at preventing these biases by modifying the training procedure. You won’t need to completely eliminate bias by any means, but the act of trying to reduce the biases will open up a whole new toolbox for how you (the experimenter/practitioner) can change how and what models learn. There are many potential ways to debias word embeddings so that their representations are not skewed along one “latent dimension” like gender. In Task 4, you’ll try to remove gender bias! The point of this part of the assignment is to have to start grappling with a hard challenge but there is no penalty for doing less-well! One common technique to have models avoid learning bias is similar to another one you already—regularization.In Logistic Regression, we could use L2 regularization to have our model avoid learning β weights that are overfitted to specific or low-frequency features by adding a regularizer penalty where the larger the weight, the more penalty the model pays in its loss (remember that model parameters are trying to lower this loss). Recall that this forces the model to only pick the most useful (generalizable) weights, since it has to pay a penalty for any non-zero weight. In word2vec, we can adapt the idea to think about whether our model’s embeddings are closer or farther to different gender dimensions. For example, if we consider the embedding for “president”, ideally, we’d want it to be equally similar to the embeddings for “man” and “woman”. One idea then is to penalize the model based on how uneven the similarity is. We can do this by directly modifying the loss: loss = loss_criterion(preds, actual_vals) + some_bias_measuring_function(model) Here, the some bias measuring function function takes in your model as input and returns how much bias you found. The example code provides a few ideas like: 1. penalizing the model for having dissimilar vectors for words like “man” and “woman” (i.e., those two vectors should have high cosine similarity) 2. penalizing the model if some word like “president” is more similar to the vector for either “man” or “woman” (i.e., the vector should have the same similarity). We can easily add in these kinds of terms because of how PyTorch tracks the gradient (compare that with how you might have done this in Homework 1!).Given the current loss for the context word prediction, we can add the penalty to this loss so that our word2vec model (1) learns to predict the right context words while (2) avoids learning biases. As a result, the backpropagation will update the embedding weights to reduce the loss with respect to both our context word prediction and bias penalty. Wow! There are many possible extensions and modification to how to compute the penalty for bias. For example, • Why just ”president”? Maybe add more words, • Why just ”man” and ”woman”? Why not other words or other gender-related words or other gender identities? • Could you use both ideas above at once? 12 • Could your force the gender information into a specific embedding dimension and the drop it? (kind of like how we dropped dimensions with the SVD) • Could you train a model to predict gender from embeddings and then penalize the model based on how well it does? • How often do you need to apply these penalties when learning? Every step? Every few steps? (how much do they slow the training down?) • How much should you weight the penalty compared with the penalty for wrong context-word predictions? Your task is to build upon the very simple ideas here and in the code to define some new “bias penalty”. You’ll add this penalty value to the loss value that you calculated from your loss function. There is no right way to do this and even some right-looking approaches may not work— or might word but simultaneously destroy the information in the word vectors (all-zero vectors are unbiased but also uninformative!). Once you have generated your model, record word vector similarities for the pairs listed on Canvas in word-pair-similarity-predictions.csv using the cosine similarity. You’ll upload the file to CodaLab (details on Piazza), which is kind of like Kaggle but lets use a custom scoring program to measure bias.We’ll evaluate your embedding similarities based on how unbiased they are and how much information they still capture after debiasing. Your grade does not depend on how well you do in CodaLab, just that you tried something and submitted. However, the CodaLab leaderboard will hopefully provide a fun and insightful way of comparing just how much bias we can remove from our embeddings. You are welcome to keep trying new ideas and submit them to see just how much bias you can remove! We’re looking forward to seeing what the most successful approach is!! 7.1 CodaLab Submission To make a submission to CodaLab, go to https://codalab.lisn.fr/ and register a new account. Then search for the competition ”SI630 W22 Homework 2” and apply to join. You can make submissions under Participate – Submit/View Results. You need to submit a zip file containing only the word pair similarity predictions.csv file.8 Optional Tasks Word2vec has spawned many different extensions and variants. For anyone who wants to dig into the model more, we’ve included a few optional tasks here. Before attempting any of these tasks, please finish the rest of the homework and then save your code in a separate file so if anything goes wrong, you can still get full credit for your work. These optional tasks are intended entirely for educational purposes and no extra credit will be awarded for doing them. 13 8.1 Optional Task 1: Modeling Multi-word Expressions In your implementation word2vec simply iterates over each token one at a time. However, words can sometimes be a part of phrases whose meaning isn’t conveyed by the words individually. For example “White House” is a specific concept, which in NLP is an example of what’s called a multiword expression. 12 In our particular data, there are lots of multi-word expressions. As biographies a lot of people are born in the United States, which ends up being modeled as “united” and “states”—not ideal! We’ll give you two ideas. In Option 1 of Optional Task 1, we’ve provided a list of common multi-word expressions in our data on Canvas (common-mwes.txt). Update your program to read these in and during the load data function, use them to group multi-world expressions into a single token. You’re free to use whatever way you want, recognizing that not all instances of a multi-word expression are actually a single token, e.g., “We were united states the leader.” This option is actually fair easy and a fun way to get multi-word expressions to show up in the analogies too, which leads to lots of fun around people analogies.Option 2 is a bit more challenging. Mikolov et al. describe a way to automatically find these phrases as a preprocessing step to word2vec so that they get their own word vectors. In this option, you will implement their phrase detection as described in the “Learning Phrases” section of Mikolov et al. [2013].13 8.2 Optional Task 2: Better UNKing Your current code treats all low frequency words the same by replacing them with an token. However, many of these words could be collapsed into specific types of unknown tokens based on their prefixes (e.g., “anti” or “pre”) or suffixes (e.g., “ly” or “ness”) or even the fact that they are numbers or all capital letters! Knowing something about the context in which words occur can still potentially improve your vectors. In Optional Task 2, try modifying the code that replaces a token with with something of your own creation.8.3 Optional Task 3: Some performance tricks for Word2Vec Word2vec and deep learning in general has many performance tricks you can try to improve how the model learns in both speed and quality. For Optional Task 3, you can try two tricks: • Dropout: One useful and deceptively-simple trick is known as dropout. The idea is that during training, you randomly set some of the inputs to zero. This forces the model to not rely on any one specific neuron in making its predictions. There are many good theoretical reasons for doing this [e.g., Baldi and Sadowski, 2013]. To try this trick out, during training (and only then!), when making a prediction, randomly choose a small percentage (10%) of the total dimensions (e.g., 5 of the 50 dimensions of your embeddings) and set these to zero before computing anything for predictions. • Learning Rate Decay: The current model uses the same learning rate for all steps. Yet, as we learn the vectors are hopefully getting better to approximate the task. As a result, we 12https://en.wikipedia.org/wiki/Multiword_expression 13http://arxiv.org/pdf/1310.4546.pdf 14 might want to make a smaller change the vectors as time goes on to keep them close to the values that are producing good results. This idea is formalized in a trick known as learning rate decay where as training continues, you gradually lower the learning rate in hopes that the model converges better to a local minima. There are many (many) approaches to this trick, but as an initial idea, try setting a lower bound on the learning rate (which could be zero!) and linearly decrease the learning rate with each step. You might even do this after the first epoch. If you want to get fancier, you can try to only start decreasing the learning rate when the change in log-likelihood becomes smaller, which signals that the models is converging, but could still potentially be fine-tuned a bit more.8.4 Optional Task 4: Incorporating Synonyms As a software library, word2vec offers a powerful and extensible approach to learning word meaning. Many follow-up approaches have extended this core approach with aspects like (1) added more information on the context with additional parameters or (2) modifying the task so that the model learns multiple things. In Optional Task 4 you’ll try one easy extension: Using external knowledge! Even though word2vec learns word meaning from scratch, we still have quite a few resources around that can tell us about words. One of those resources which we’ll talk much more about in Week 12 (Semantics) is WordNet, which encodes word meanings in a large knowledge base. In particular, WordNet contains information on which word meanings are synonymous. For example, “couch” and “sofa” have meanings that are synonymous.In Optional Task 4, we’ve provided you with a set of synonyms (synonyms.txt) that you’ll use during training to encourage word2vec to learn similar vectors for words with synonymous meanings. How will we make use of this extra knowledge of which woulds should have similar vectors? There have been many approaches to modifying word2vec, some which are in your weekly readings for the word vector week. However, we’ll take a simple approach: during training, if you encounter a token that has one or more synonyms, replace that token with a token sampled from among the synonymous tokens (which includes that token itself). For example, if “state” and “province” are synonyms, when you encounter the token “state” during training, you would randomly swap that token for one sampled from the set (“state”, “province”). Sometimes, this would keep the same token, but other times, you force the model to use the synonym—which requires that synonym’s embedding to predict the context for the original token.Given more epochs, you may even predict the same context for each of the synonyms. One of the advantages of this approach is that if a synonyms word shows up much less frequently (e.g., “storm” vs. “tempest”), the random swapping may increase the frequency of the rare word and let you learn a similar vector for both. Update the main function to take in an optional file with a list of synonyms. Update the train function (as you see fit) so that if synonyms are provided, during training tokens with synonyms are recognized and randomly replaced with a token from the set of synonyms (which includes the original token too!) Train the synonym-aware model for the same number of epochs as you used to solve Task 3 and save this model to file. 15 As you might notice, the synonyms.txt has synonyms that only make sense in some contexts! Many words have multiple meanings and not all of these meanings are equally common. More over a word can have two parts of speech (e.g., be a noun and a verb), which word2vec is unaware of when modeling meaning.As a result, the word vectors you learn are effectively trying to represent all the meanings for a word in a single vector—a tough challenge! The synonyms we’ve provided are a initial effort of identifying common synonyms, yet even these may shift the word vectors in unintended ways. In the next problem, you’ll assess whether your changes have improved the quality of the models. Load both the original (not-synonym-aware) model and your new model into a notebook and examine the nearest neighbors of some of the same words. For some of the words in the synonyms.txt file, which vector space learns word vectors that have more reasonable nearest neighbors? Does the new model produce better vectors, in your opinion? Show at least five examples of nearest neighbors that you think help make your case and write at least two sentences describing why you think one model is better than the other. 9 Hints 1. Start early; this homework will take time to debug and you’ll need time to wait for the model to train for Task 1 before moving on to Tasks 2-4 which use its output or modify the core code. 2. Run on a small amount of data at first to get things debugged.3. The total time for load data on the reference implementation loading the normal training data is under 20 seconds. 4. The training data generating code takes around 10-15 minutes on the wiki-bios.med.txt file. 5. Average time per epoch on our tests was ∼ one hour for the wiki-bios.med.txt file. If you are experiencing times much greater than than, it’s likely due to a performance bug somewhere. We recommend checking for extra loops somewhere. 6. If your main computer is a tablet, please consider using Great Lakes for training the final models (you can develop/debug locally though!) 10 Submission Please upload the following to Canvas by the deadline: 1. your code for word2vec 2. your jupyter notebook for Parts 3 and 4 16 3. a PDF copy of your notebook (which as the written answers) that also includes what your username is on CodaLab. Please upload your code and response-document separately. We reserve the right to run any code you submit; code that does not run or produces substantially different outputs will receive a zero. In addition, you should upload your debiased model’s similarity scores to the CodaLab site.References Pierre Baldi and Peter J Sadowski. Understanding dropout. Advances in neural information processing systems, 26: 2814–2822, 2013. Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014. 17
Homework 1 will have you build your first NLP classifier, Logistic Regression. Logistic Regression is a simple, yet often surprisingly effective model for text classification—you may have even used Logistic Regression in one of the common python machine learning libraries like sklearn!1In Homework 1, you’ll build Logistic Regression from scratch to get a sense of how these models work. Your implementation will be very simple (and slow) but will give you much more intuition for how machine learning models work and, critically, how deep learning models work.Homework 1 will have you implement Logistic Regression in two ways. In the first, you’ll use numpy to implement all parts of Logistic Regression, including the stochastic gradient descent and update steps. In the second part, you’ll implement the same approach using pytorch, a model deep learning library. For working with numeric data (e.g., vectors of numbers) pytorch and numpy are very similar.2 However, pytorch is designed for training deep learning models and can automate much/all of the updating steps. We are having you implement Logistic Regression twice using different libraries to give you more intuition on what pytorch is doing (and how stochastic gradient descent works!) while also getting you up to speed on how to use pytorch, which you’ll need for all later assignments.The programming part of this assignment will be moderately challenging due to (1) needing to map concepts and equations to python code and (2) needing to learn how to use two new libraries.In practice, the code for logistic regression is about 20-30 lines (or less), with both the numpy and pytorch models sharing some additional code for simple I/O and setting up the data. The big challenge will be from trying to ground the mathematical concepts we talk about in class in actual code, e.g., “what does it mean to calculate the likelihood???”. Overcoming this hurdle will take a bit of head scratching and possibly some debugging but it will be worth it, as you will be well on your way to understanding how newer algorithms work and even how to design your own!There are equations in the homework. You should not be afraid of them—even if you’ve not taken a math class since high school. Equations are there to provide precise notation and wherever possible, we’ve tried to explain them more. There are also many descriptions in the lecture materials and the Speech & Language Processing textbook that can help explain these more if you need. Logistic Regression is a very common technique, so you will also find good explanations online if you need (e.g., the visualizations in this video, which can give you a sense of what the bias term is doing). Getting used to equations will help you express yourself in the 1https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.future. If you’re not sure what something means, please ask; there are six office hours between now and when this assignment is due, plus an active Piazza discussion board. You’re not alone in wondering and someone will likely benefit from your courage in asking. Given that these are fundamental algorithms for NLP and many other fields, you are likely to run into implementations of Logistic Regression online. While you’re allowed to look at them (sometimes they can be informative!),3 all work you submit should be your own (see Section 9).Finally, remember this is a no-busywork class. If you think some part of this assignment is unnecessary or too much effort, let us know. We are happy to provide more detailed explanations for why each part of this assignment is useful. Tasks that take inordinate amounts of time could even be a bug in the homework and will be fixed.The learning goals for this assignment are as follows: • Understand how Logistic Regression works, what its parameters do, and how to make predictions • How to work with sparse data structures • How gradient descent and stochastic gradient descent work in practice • Understand how to implement classifiers using numpy, and more generally, how to work with new python libraries • Learn how to implement and train a (shallow) neural network with pytorch • Improve your software development and debugging skills4The 2020 Election season in the United States brought in record amounts of funding and voter turnout. A key part of that was the use of electronic messaging to encourage donations and voting. Through some very neat data collection, our colleagues at Princeton have collected ∼435K emails sent from candidates running for state and federal office, political parties, and other political organizations like Political Action Committees (PACs) Mathur et al. (2020).5 These emails contain a surprising and fascinating amount of diversity and tactics for getting people to contribute— including “dark patterns” like having the email appear to be sent from a different sender. You can read more if interested in their paper describing the data and some analyses on it (it’s a great read).3The instructors would also like to caution that not all implementations are correct online, nor are all the implementations of a particular algorithm actually what is being asked for in this homework. Some students in prior semesters have found themselves more confused after reading many different descriptions/implementations of the algorithms and only gained clarity after trying to walk through the material themselves. As always, your instructors are here to help if you have questions!4Seriously! Advanced software development will have you finding all sorts of weird bugs that will take some leaps to figure out. Be prepared but know that the skills you learn here will greatly help you in your career. Learning how to debug code using unfamiliar libraries and do this development is a key goal of this assignment! 5https://electionemails2020.org/ Political actors frequently use framing devices and rhetorical strategies to motivate their bases.These frames might look like describing the need to donate as a moral imperative or that the person’s health or safety is at risk if the emailing candidate loses. Different political parties will employ a common set of strategies that they know resonate with their base, which suggests we should see some regularity in the messaging, despite the emails being sent by thousands of candidates or organizations. Your task is to build a classifier to assess to what degree can we predict the political party of the sender from this type of messaging?Of course, these emails contain the candidate’s name, which would make the classification task very easy if done from the raw text—the model would just learn which candidate was from which party! To focus particularly on framing, we’ve removed references to all Named Entities (people, places, organizations) from the emails for you, as well as any mailing addresses. While some noise does exist (NLP isn’t perfect, after all), the data should better capture the message we want to analyze using a classifier.Important Side Note on the Data: This data is provided by the folks at Princeton with some restrictions. The most important is that you will not republish the underlying data, so this data should never be put on your website, left on a public URL, or pushed to a public git repo (for example). You can view the full set of conditions and terms here.3 Homework Design Notes This homework is divided into three parts. Part 1 has you implementing the common pre-processing steps that turn the text into a term-document matrix. You’ll need to write code to read in the data, tokenize it, and convert it to some numeric representation. Part 2 has you implement Logistic Regression using numpy, including all the details of the stochastic gradient descent. Part 3 has you implement Logistic Regression using pytorch. In Part 3, you’ll also do a few experiments using development data to see what is the effect of tuning certain hyperparameters.Notation Notes : For the whole assignment description we’ll refer to classifier features as x1, . . . , xn ∈ X where xi is a single feature and X is the set of all features. When starting out, each feature will correspond to the presence of a word; however, you are free in later parts of the homework to experiment with different kinds of features like bigrams that denote two consecutive words, e.g., “not good” is a single feature. We refer to the class labels as y1, . . . , yn ∈ Y where yi is a single class and Y is the set of all classes. In our case, we have a binary classification tasks so there’s really only y1 and y2. When you see a phrase like P(X = xi |Y = yj ) you can read this as “the probability that we observe the feature (X) xi is true, given that we have seen the class (Y ) is yj”.We’ll also use the notation exp(xi) to refer to e xi at times. This notation lets us avoid superscript when the font might become too small or when it makes equations harder to follow. Implementation Note Unless otherwise specified, your implementations should not use any existing off-the-shelf machine learning libraries or methods (i.e., no using sklearn). You’ll be using plain old Python along with numeric libraries like numpy and pytorch to accomplish everything.Homework 1 has three associated files (you can find them on our Kaggle competition associated with this homework): • train.csv This file contains the comments, their IDs, and their labels you will use to train your classifiers. The party affiliation column contains the political party of who sent the email text in the email text. • dev.csv This file is the emails and their IDs that you will make political party predictions using your trained classifiers and upload results to Kaggle.• test.csv This file is the emails and their IDs that you will make political party predictions using your trained classifiers and upload results to Kaggle. As a part of this assignment, we’ll be using Kaggle in the Classroom to report predictions on the test data. This homework is not a “leaderboard competition,” per se. Instead, we’re using Kaggle so you can get a sense of how good your implementation’s predictions are relative to other students. Since we’re all using the same data and implementing the same algorithms, your scores should be relatively close to other students. If you decide to take up the optional part and do some feature engineering, you might have slightly higher scores, but no one will be penalized for this. We’ve set up a Kaggle competition for both bare-bones Logistic Regression and Logistic Regression with PyTorch. You’ll first need to sign up using the invitation link then submit your scores to the competition.• LR competition link: https://www.kaggle.com/c/umsi630w22hw1-1 • LR invitation link: https://www.kaggle.com/t/5cac649ea0614941bc338ef7ec01dadf • PyTorch LR competition link: https://www.kaggle.com/c/umsi630w22hw1-2 • PyTorch LR invitation link: https://www.kaggle.com/t/ac95c0cf9f5b4443a30a06fea0715Part 1: Representing Text Data Before we can start classifying, we need to turn our text data into some numeric representation that classifiers can work with. In class, we talked about using a bag of words representation—i.e., each document is represented as a vector x of length |V| (the number of unique words across all our documents) and we count how many times each word occurs in that document so that xi is the number of times the word represented by column i occurs.Part 1 is divided into two tasks: (1) tokenizing documents into words and (2) turning those lists of words into vectors.5.1 Task 1.1: Tokenization There are many, many ways to tokenize text. We’ll try to use two in this homework to give you a sense of how you might do it. There are few wrong ways to tokenize, though some approaches will give you better results both in terms of token quality and downstream performance when you train your models.• Write a function called tokenize that takes in a string and tokenizes it by whitespace, returning a list of tokens. You should use this function as you read in the training data so that each whitespace separated word will be considered a different feature (i.e., a different xi). Notice that we are probably doing a bad job of tokenizing due to punctuation. For example, “good” and “good,” are treated as different features because the latter has a comma at the end and “Good” and “good” are also different features because of capitalization. Do we want these to be different? Furthermore, do we want to include every token as a feature? (Hint: could a regular expression help you filter possible features?) Note that you have to implement the tokenization yourself; no importing NLTK/SpaCy and wrapping their functions (though you might take some inspiration from them).• Write a better tokenizing function called better tokenize that fixes these issues. In your report, describe which kinds of errors you fixed, what kind of features you included, and why you made the choices you did. We’ll use these functions later when we evaluate the models.5.2 Task 1.2: Building the Term-Document Matrix As a thought experiment, consider what might happen if we make a term-document matrix for our training data. This matrix would be D×V in size. Since our vocabulary likely gets larger with each new document, this matrix will be huge—potentially tens or hundreds of gigabytes—and won’t fit on your personal computer. Yet, libraries like sklearn work just fine, so what’s going on? In practice, the term-document matrix is very sparse; most documents don’t have most of the words, so we don’t actually need to represent all these zeros!In your solution, we’ll want to use a sparse matrix representation. Conveniently the SciPy library has several sparse matrix implementations that we can use.6 When building your termdocument matrix you’ll want to create one of these matrices. When you create the matrix, you’ll need to keep track of which dimensions correspond to which words, so that when you see a new document, you’ll know how to create a new vector with the correct dimensions filled in for its words.How many words should you include in your vocabulary V? Including everything is tempting (it’s easy after all). However, let’s think about what happens if we include all words. First, word frequency follows Zipf’s Law, which, in practice, means that most words occur very rarely (roughly 80% of the unique words make up less than 20% of the tokens we seen). As a result, most of the unique words will have very low frequency—with many occurring only once in our data. If a 6https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/word occurs only a handful of times, it’s not particularly useful to us for classification since (i) the presence of the word doesn’t generalize to many documents and (ii) including the word increases the number of parameters we have to learn (without likely helping performance). As a result, most text classification systems will impose some minimum word frequency for including the word in the vocabulary.In your solution, you should use a minimum word frequency of 10; every word occurring fewer than 10 times will not be included in the vocabulary (and therefore doesn’t get a dimension in the β vector. Import Caveat: In building the matrix, we’ll eventually need to account for the bias term in our numpy implementation. Frequently, practitioners will simply add a single column to the termdocument matrix that always has the value 1, which will get multiplied by the bias coefficient. However, the pytorch library will handle the bias term for us (yay!) so we don’t always need to add t this column. You’ll want to design your matrix-building code so that you can correctly account for the bias term.One hint is that in building the matrix, you might find the set, defaultdict and Counter classes in python to come in handy.6 Part 2: Logistic Regression in numpy In Part 2, you’ll implement logistic regression, which you might recall is P(y = 1|x, β) = 1 1 + exp − PN i=1 xiβi Note that our implementation of Logistic Regression is restricted to two classes, which we represent as binary so that y is either 0 or 1.7 Conveniently, your training data is already set up for binary classification. Your implementation will be one of the simplest possible formulations of logistic regression where you use stochastic gradient descent to iteratively find better parameters β. Smarter solutions like those in sklearn will use numerical solvers, which are much (much) faster. The purpose of the homework setup is to give you a sense of how to compute a gradient.In Task 1, you’ll implement each step of the Logistic Regression classifier in separate functions. As a part of this, you’ll also write code to read in and tokenize the text data. Here, tokenization refers to separating words in a string. In the second half, you’ll revisit tokenization to ask if you can do a better job at deciding what are words. • Implement a function called sigmoid that implements the sigmoid function, S S(X) = 1 1 + exp(−X) . Your function should be vectorized so that it computes the sigmoid of a whole vector of numbers at once. Conveniently, numpy will often do this for you, so that if you multiplyImportant note: There is multiclass Logistic Regression, which the SLP3 textbook goes into detail about. This multiclass setup is more complicated (though also more useful) so we are not implementing it in the homework.a number to a numpy array, it will multiply each item in the array (the same applies to functions used on an array (hint)). If you’re not sure, please ask us! You’ll need to use the sigmoid function to make predictions later.• Implement a function called log likelihood that calculates the log likelihood of the training data given our parameters β. Note that we could calculate the likelihoood but since we’ll be using log-arithmetic version (hence the log-likelihood) to avoid numeric underflow and because it’s faster to work with. Note that we could calculate the log likelihood ll over the whole training data as ll = X i=1,…,n yiβ T xi − log(1 + exp(β T xi)) where β is the vector of all of our coefficients. It takes some steps to reach this formula, the detail of how this formula is derived will be discussed in the discussion section. Feel free to skip the tedious math derivation and just use the formula above to implement! However, you’ll be implementing stochastic gradient descent, where you update the weights after computing the loss for a single randomly-selected (stochasticly!) item from the training set.• Given some choice of β to make predictions, we want to use the difference in our prediction Yˆ from the ground truth Y to update β. The gradient of the log likelihood tells us which direction (positive or negative) to make the update and how large the update should be. Write a function compute gradient to compute the gradient. Note if we were doing Gradient Descent, we could compute the whole gradient across the whole dataset using ∇ll = X T (Y − Yˆ )Note that Y is a binary vector with our ground truth (i.e., the training data labels) and Yˆ is the binary vector with the predictions. However, we’re going to compute the gradient of the likelihood one instance at a time, so you can calculate the gradient for how to update β as δ δβ ll(β) = (σ(βxi) − yi)xiThis equation tells us how close our binary prediction σ(βxi) is to the ground truth (yi − σ(βxi)) and then uses the loss to scale the feature values of the current instance xi to see how to update β. In essence, this update tell us how to update β so that it learns which features to weight more heavily and how to weight them (or conversely, which features to ignore!). When doing learning we’ll scale this gradient by the learning rate α to determine how much to update β at each step as βnew = βold − α δ δβ ll(β) To get a sense of why this works, think about what gradient will equal if our prediction for item i, yˆi is the same as the ground truth yi ; if we use this gradient to update our weight for βi , what effect will it have?Important Tip: The SLP3 book also has a very nice walk-through of logistic regression and stochastic gradient descent in Chapter 5, Section 5.4, which includes a detailed example of the update steps in 5.4.3. If anything in the homework description is confusing, please feel free to reach out to us on Piazza and/or check out the SLP3 material!8 • Putting it all together, write a function logistic regression that takes in a – a matrix X where each row is a vector that has the features for that instance – a vector Y containing the class of the row – learning rate which is a parameter to control how much you change the β values each step – num step how many steps to update β before stopping Your function should iteratively update the weight vector β at each step by making predictions, yˆi , for each row i of X and then using those predictions to calculate the gradient. You should also include an intercept coefficient.9 Note that you can make your life easier by using matrix operations. For example, to compute yˆi , multiply the row of matrix X by the β vector. If you’re not sure how to do this, don’t worry! Please come talk to us during office hours! • Write a function predict that given some new text, converts it to a vector (i.e., something like a row from X), and then uses the β vector to predict the class and returns the class label. • To see how your model is learning, first, train your model on the training data learning rate=5e-5 (i.e., a very small number) and just use num steps = 1000. Note: This means training only on 1000 emails; if your function takes a while, you might have implemented full Gradient Descent, which isn’t asked for. Make a plot of the log-likelihood for the full data every 100 steps for this initial pass. Did the model converge at some point (i.e., does the log likelihood remain stable)?• Now that you have some sense of how your hyperparameters work, train the model until it converges. This will likely take several epochs, depending on your learning rate. You can stop when the log-likelihood does not change too much between steps. Typically a very small number is used (e.g., 1e-5), though you are welcome to define your own. • After training on the training data, use your logistic regression classifier to make predictions on the validation dataset and report your performance using the F1 score. • Submit your best model’s predictions on the test data to the KaggleInClass competition for numpy Logistic Regression. A few more hints: 8We also want to add that despite the fancy-math, your actual implementation will mostly consist of very common operations!9The easiest way to do this is to add an extra feature (i.e., column) with value 1 to X; the numpy functions np.ones and np.hstack with axis=1 will help.• Be sure you’re not implementing full gradient descent where you compute the gradient with respect to all the items. Stochastic gradient descent uses one instance at a time. • If you’re running into issues of memory and crashing, try ensuring you’re multiplying sparse matrices. The dense term-document matrix is likely too big to fit into memory.7 Part 3: Logistic Regression with PyTorch You’ve come through a long journey when you are finally at this exciting PyTorch part! PyTorch is a very popular deep learning framework in both academic and industrial fields and is the library we’ll use for this and all future NLP assignments. With this framework, one can build deep learning models with flexible architectures in a clean and concise manner. It may be a little bit challenging at the beginning, but once you’ve got the idea, you will be amazed to find PyTorch is so powerful that can enable you to build complex models within a few lines of code. The goal of Parth 3 is to expose you to the “workflow” of how PyTorch trains models (since we’ll be using this framework throughout the whole course) while also giving you some reference comparison point in your numpy implementationIn Part 3, you will implement the logistic regression model with PyTorch. You’ll compare your PyTorch version with your numpy version and see if one performs better than the other. 7.1 Setting up the Data PyTorch works with tensors.10 A tensor is a fancy math term for as a matrix with an arbitrary number of dimensions (e.g., a vector is a tensor with 1 dimension!), which is very similar to arrays in numpy. Conveniently, pytorch also provides sparse tensor abilities,11 which we’ll need to use to represent our term document matrix tensor (it’s still a matrix though!). • Write a to sparse tensor function that takes the sparse numpy (SciPy) representation of the term-document matrix you used earlier and converts it into sparse Tensor objects. You will use the transformed data in tensor format for training the model in the later part.7.2 Building the Logistic Regression Neural Network Logistic Regression can be considered a shallow neural network (|V| input neurons and 1 output neuron), so we’ll use PyTorch’s functionality for building neural networks to help you get started for the semester. PyTorch has a special way to define neural networks so that we can use its very convenient functionality for training them. To create a new newtork, you’ll extend the nn.Module class,12 which is the base class for all networks. Extending this class lets us hook into PyTorch’s deep learning training. In particular, the class will keep track of which parameters will need updating during gradient descent if they are assigned as a field to the class—this means 10https://en.wikipedia.org/wiki/Tensor11https://pytorch.org/docs/stable/sparse.html 12If you haven’t seen class extension before, there are many good examples (like this one online; the code for your subclass is not complex, so you don’t need to understand it too well, provided you get the syntax right. 9 we won’t have to calculate the loss like we did before or do the updating for stochastic gradient descent.The nn.Module class defines the forward() function which is the most important function for us that we need to override. The forward function goes from the inputs to the outputs (i.e., from our bag of words vector to the predicted party). We’ll specify the inputs and then use the network layers in the method to produce our outputs. Important note: Since neural networks are non-linear functions in the mathematical sense, PyTorch has overloaded the function operator () so that calling a network will call forward(). For example with the code model.forward(inputs) and model(inputs) are equivalent! Note that this convention will apply to all network layers too; these layers take inputs and produce outputs, so we can call them like functions. You might also guess that there’s a backward() function too—and you’re right! That’s the step that goes from the outputs back to the inputs and updates the weights along the way. However, we don’t have to do anything here. That’s the magic: PyTorch handles the backwards updates for us and will update the β weights in our Logistic Regression model automatically as long as we call the training loop code correctly.• In this part, you’ll write a LogisticRegression class which is an extension of nn.Module. Think about: What’s the input and the output? What’s the shape of weights need to be trained? What kind of layer(s) do we have in logistic regression?7.3 Setting up the Training After your model implementation, the next step is to train the network. In particular, you need to define the loss function and optimizer and then use your prepared data to train the model. These are specific kinds of objects for pytorch but we’ve already seen examples of these concepts. The loss function tells pytorch how to measure how close our predictions are to the correct outputs. In our setting, we have a binary prediction task, so we’ll use Binary Cross Entropy Loss or BCELoss13 .You actually already implemented this in numpy! Pytorch has many other loss functions that make it easy to switch to multiclass classification (1 of n outputs is true) or even multilabel classification (k of n outputs is true). An optimizer object tells pytorch how to update the model’s parameters based on how wrong the model’s predictions were. We’ve already seen one such optimizer: Stochastic Gradient Descent! In our case, the SGD update rules are quite simple for updating our β parameters—so simple that we could easily write them in numpy; however, for larger networks with multiple layers, the update process becomes very complicated—just think of the two layer network example from Week 2’s lecture! Part of the power of PyTorch comes from the optimizer being able to automatically figure out how to update the parameters without us having to specify the code to do so. Here, we’ll use the SGD14 optimizer to start with. However, there are many other optimizers to start with and you’ll try a few later in the assignment. All of the optimizers need to know which parameters they’re updating so as a part of instantiating these objects, you’ll need to pass in your model’s parameters as an argument, which is why we need to extend the nn.Module so that all of this process works.13https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html#torch.nn. BCELoss 14https://pytorch.org/docs/stable/generated/torch.optim.SGD.html 10 The training process works in two steps. First, you’ll run the forward pass of your code where you provide some input and the model makes a prediction. Then you’ll start the backwards part of the process to update the parameters to make better predictions. This is the backpropagation algorithm15 that calculates the updates based on how each weight in the network contributed to the ultimate prediction. The loss function object will calculate the loss and subsequent gradient of change for us. Then, the optimizer will update the parameters with respect to that gradient. There are many, many tutorials on how to do this training process, since every pytorch network does the same series of operations regardless of what the network is being trained to do. The pytorch documentation has one good example training process for a vision classifier and you can find others like this one that describe similar steps.PyTorch will automatically keep track of the gradients over multiple steps as long as we tell it to. When it’s time for evaluating, rather than training, we need to tell the model to stop remembering those gradients. The pytorch Module class has train() and eval() functions that • Instantiate a BCELoss object that you’ll use as your loss function • Instantiate a SGD object as your optimizer.• Write your training loop that iterates over the data set for the specified number of epochs and then for each step in the epoch, randomly samples one instance (row from the termdocument tensor) and gets a prediction from pytorch LogisticRegression network. This process will look very similar to how you did your core training loop in numpy, only you’ll add in the pytorch specific bits. Use the loss function to measure the loss and then the optimizer to perform the backpropagation step (using the backwards() and step() functions.• Write some code that during the training process, will evaluate the current model on the development data. You’ll want to only do this evaluation periodically (not every step) and the prediction code will be the same for the test data (though without the score). Remember the eval() call!7.4 Training and Experiments Now we’ve gotten the whole network and training procedure defined so it’s time to start training. Since training in pytorch should be a bit faster and we want you to explore the code a bit, you’ll do a few experiments in pytorch to see their effect.For most of these mini-experiments, you’ll be doing the same type of operation: varying one hyperparameter and then plotting the performance of the model during certain training steps. We recommend trying to refactor your code in such a way that it is easy to periodically call the evaluation step to get the model’s loss and its performance on the development data. For plotting, we highly recommend plotting with seaborn, though any plots will do. While we recognize that some of these tasks can look repetitive, they are intentionally designed to give you a deeper intuition for how design and hyperparameter choices can ultimately impact performance. The trends you see 15https://en.wikipedia.org/wiki/Backpropagationhere may vary on future models and data, but the intuition and ability to quickly test the impact of these choices will come in handy in future homeworks and beyond as a practitioner. • Like you did for the numpy code, train your model for a total of 1000 steps (i.e., only showing it 1000 randomly sampled documents) and report the loss after each 100 steps.This should verify that the loss is going down. • Once you’re satisfied that the model is working, train your model for at least 5 epochs and compute (and save) both (1) the loss every 1000 steps and (2) the F1 score of the model on the development data. • Let’s see what are the effects of adding regularization. PyTorch builds in regularization through the weight decay argument to most Optimizer classes (including the one you use, SGD. Let’s see what the effect is for setting the L2 penalty to 0 (default), 0.001, and 0.1. For each L2 penalty value, train the model for 1 epoch total and plot the loss and F1 score on the development set every 1000 steps using one line for each L2 value (use separate plots for F1 vs. Loss). In a few sentences, describe what you see: What effect does L2 have on the convergence speed and overall model performance?• PyTorch has more than just the SGD optimizer. Recall that SGD takes a step with respect to the gradient using the learning rate to scale how big of a step to take. But what if we used more than just the gradient? For example, could we keep a bit of momentum to keep our gradient heading in the same direction? Many new optimizers have been proposed and PyTorch keeps implementations of the more successful ones. The big benefit of a better optimizer is that it helps the model learn faster. Since some of our big models may take hours to converge, having the optimizer reduce the training time to under an hour can be a huge time and environmental benefit. For this step, replace your SGD optimizer with two other common alternatives RMSprop and AdamW. For each optimizer, train the model for 1 epoch total and plot the loss and F1 score on the development set every 1000 steps using one line per optimizer (use separate plots for F1 vs. Loss). In a few sentences, describe what you see: What effect does the choice in optimizer have on the convergence speed and overall model performance?• We had two methods for tokenizing. Does one method perform better in practice here? Train your model in the basic setting (using SGD, no L2 loss) for 1 epoch and plot the loss and F1 score on the development set every 1000 steps using one line per optimizer (use separate plots for F1 vs. Loss). In a few sentences, describe what you see: what effect does tokenization have on the overall model performance?• What effect does the learning rate have on our model’s convergence? Using the basic setup with SGD, change the lr argument (i.e., learning rate) to a much larger and much smaller value and for 1 epoch and plot the loss and F1 score on the development set every 1000 steps using one line per lr value. Plot all three curves together and describe what you observe: what effect does the learning rate have on the model’s convergence speed? You’re welcome (encouraged, even!) to try additional learning rates. If your model converges quickly, you can also reduce the number of steps between evaluation for this question.• Finally, use your best model to generate prediction results and report the final F1 score. Don’t forget to upload your prediction results on Kaggle. Note that this is a separate competition so that you can compare your scores with the numpy version.7.5 Optional Part 4 Parts 1 only used unigrams, which are single words. However, longer sequences of words, known as n-grams, can be very informative as features. For example “not offensive” and “very offensive” are very informative features whose unigrams might not be informative for classification on their own. However, the downside to using longer n-grams is that we now have many more features. For example, if we have n words in our training data, we could have n2 bigrams in the worst case; just 1000 words could quickly turn into 1,000,000 features, which will slow things down quite a bit. As a result, many people threshold bigrams based on frequency or other metrics to reduce the number of features. In Part 3, you’ll experiment with adding bigrams (two words) and trigrams (three words) and measuring their impact on performance. Part 3 is entirely optional and included for people who want to go a bit further into the feature engineering side of things.• Count how many unique, unigram, bigrams, and trigrams there are in the training data and report each number. • Are the bigram and trigram counts you observe close to the worst case in terms of how many we could observe? If not, why do you think this is the case? (Hint: are all words equally common? You might also check out Zipf’s Law).• What percent of the unique bigrams and trigrams in the development data were also seen in the training data? • Choose a minimum frequency threshold and try updating your solution to use these as features. We recommend creating a new method that wraps your tokenize method and returns a list of features.8 Submission Please upload the following to Canvas by the deadline: 1. a PDF (preferred) or .docx with your responses and plots for the questions above. Your submission needs to include your username on Kaggle. 2. your code for all parts. Could should be uploaded as separate files, not in a .zip or other archive. Code may be submitted as a stand-alone file (e.g., a .py file) or as a Jupyter notebook. We reserve the right to run any code you submit. Code that does not run or produces substantially different outputs will receive a zero.References Arunesh Mathur, Angelina Wang, Carsten Schwemmer, Maia Hamin, Brandon M. Stewart, and Arvind Narayanan. 2020. Manipulative tactics are the norm in political emails: Evidence from 100k emails from the 2020 u.s. election cycle. https://electionemails2020.org.
We are going to apply the DCT transformation for image compression. A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. DCT helps separate the image into parts (or spectral sub-bands) of differing importance (with respect to the image’s visual quality). DCT is similar to the discreteFourier transform: it transforms a signal or image from the spatial domain to the frequency domain. To an image f, f(x,y) represent the pixel value at (x,y), the equation of the DCT tranformation is: F(u, v) = α(u)α(v) M X−1 x=0 N X−1 y=0 f(x, y) cos (2x + 1)uπ 2M cos (2y + 1)vπ 2N , where u = 0, 1 . . . .M − 1, v = 0, 1 . . . N − 1. α(u) = ( √ 1 M u = 0 q 2 M u ̸= 0 α(v) = ( √ 1 N v = 0 q 2 N v ̸= 0 .For most images, much of the signal energy lies at low frequencies; these appear in the upper left corner of the DCT. To reverse the image, the equation of the inverse DCT (iDCT) tranformation is: f(x, y) = M X−1 u=0 N X−1 v=0 α(u)α(v)F(u, v) cos (2x + 1)uπ 2M cos (2y + 1)vπ 2N , where x = 0, 1 . . . .M − 1, y = 0, 1 . . . N − 1. α(u) = ( √ 1 M u = 0 q 2 M u ̸= 0 α(v) = ( √ 1 N v = 0 q 2 N v ̸= 0 .The whole process to compress an image with the DCT transformation takes following steps: 1. use DCT transformation to obtain the representation in frequency domain; 2. compress the representation in frequency domain by preserving data in the top-left corner (1/4 area by 1/2 width and 1/2 height); 3. recover the image using inverse DCT transformation (iDCT). Complete the function and some blanks in myHought.X-1 use the image lena.jpg as grayscale and plot your experiment results. Note that you can NOT use any functions relevant to DCT in extended packages like opencv. • Visualize results including the representation in the frequency domain and the recovered image with/without compression in the frequency domain.• Use another compression scale. Preserving data in the top-left corner of 1/16 area (1/4 width and 1/4 height). Comparing the recover image. • Blockwise compression vs. Global compression. Divide the image into 8 by 8 pixel image patches. For each patch, perform the same operation (transformation,compression,recovering) and gather all the patches into the whole image. Comparing the results of blockwise compression and global compression (including the representation in the frequency domain). Submission format. Please submit your filled compression.py. Write and analyze your experiment visualization results of above tasks in the write-up.Convolution is one of the most important operations for image processing. It is an effective way to extract the desired local features from an image. Implementing a 2D convolution function will help you to get familiar with the mechanism of convolution.Requirements. Complete the myConv2D function in myConvolution.py. The input of myConv2D includes • img – a grayscale input image. • kernel – a 2d convolution kernel. You can assume that the kernel is always odd-sized.• stride – the parameter to control the movement of the kernel over the input image. When the input height and width are not divisible by the stride, the remaining rows and columns will be ignored. The amount of stride can be defined by a two-element tuple, or a single integer, in which cases it controls the movement amount for both the x-direction and the y-direction.• padding – the parameter to control the amount of padding to the input image. The padding should be added to all four sides of the input image. You can call numpy.pad to pad the array. The amount of stride can be defined by a two-element tuple, in which case the first element controls the top and bottom padding size and the second one controls the left and right padding size; or a single element, which controls the padding size of all the sides. The details of the API can be found HERE.The output of myConv2D is the processed image. The size of the output image can be calculated by output[0] = img[0] + 2 × padding[0] − kernel[0] + stride[0] stride[0] output[1] = img[1] + 2 × padding[1] − kernel[1] + stride[1] stride[1] Your code should not call on convolve, correlate, fftconvolve, or any other similar functions from any libraries. Hint. Here are some tips for a fast and elegant implementation.X-2 1. Relying on mathematical functions to operate on vectors and matrices can speed up your calculation. You can find some examples of vectorized operations HERE. 2. The number of for loops should be as small as possible. Two loops may be sufficient for the implementation. Submission format. Please submit your filled myConvolution.py.Edge detection is a classical task in the field of computer vision. The Canny’s algorithm is one of the most popular solutions to this task because it can detect edges with a low error rate, locate the centerline of the edges accurately, and ignore most of the noises.The process of canny can be broken down into the following steps 1. Filter the noise in the image by a Gaussian filter. 2. Find the intensity gradient of the image. A common choice is to apply the Sobel filter of the x-direction and the y-direction to do the convolution respectively. 3. Apply non-maximum suppression to filter those pixels that have high gradient intensity but are not on an edge. Generate the edge candidates.4. Determine the potential edges by the hysteresis threshold. The threshold is two-sided. When the pixel value is smaller than the lower threshold, it must not be a part of an edge. When the pixel value is larger than the upper threshold, it must be a part of an edge. When the pixel value is between the thresholds, decide whether the pixel is on an edge by its neighbors. If any of its neighbors is a member of an edge, this pixel is a part of an edge.Requirements. Complete the myCanny function in myEdgeDetector.py. The input of myCanny includes • img – a grayscale input image. • sigma – the standard deviation of the Gaussian kernel. In this implementation, it will also define the size of the Gaussian kernel as (2 ∗ ⌈3 ∗ sigma⌉ + 1) × (2 ∗ ⌈3 ∗ sigma⌉ + 1). • threshold – the hysteresis threshold.The output of myConv2D are the input image and the detection result, whose size should be the same as the input image. Your code should not call on the implementation of canny or similar functions from any libraries. Submission format. Please submit your filled myEdgeDetector.py. Test your edge detector on image hw3_zebra.jpg and report your experiment result in the write-up.Part A Task Applies the Hough Transform to an edge detection result of an image. Display the output in the write-up using img01.jpg and img02.jpg. Complete the function in the myHoughTransform.py: [img houghtrans, rhoList, thetaList] = Handwrite HoughTransform(img threshold, rhostep, thetastep).For the input of function: X-3 • img_threshold – the edge detection result, thresholded to ignore pixels with a low edge filter response. • rhostep (scalar) – the distance resolution of the Hough transform accumulator in pixels. • thetastep (scalar) – the angular resolution of the accumulator in radians.Here we give a suitable rhostep, thetastep value in houghScript.py. For the output of function: img houghtrans, rhoList, thetaList. img houghtrans is the Hough transform accumulator that contains the number of “votes” for all the possible lines passing through the image. rhoList and thetaList are the 1-D arrays of ρ and θ values ranging from 0 to maxinum ρ/θ by the step of rhostep/thetastep. If rhoList[i] = ρi and thetaList[j] = θj , then img houghtrans[i,j] contains the votes score for ρ = ρi and θ = θj .The main script to run this experiment is in houghscript.py, note that you don’t need to fill any blanks here and you can directly use it as long as you implement the function tools. A sample visualization of img houghtrans is shown in Figure 1. Figure 1: A sample visualization of img houghtrans.Part B Task Use your Hough transform and write a function in myHougnlines.py to detect lines: [rhos, thetas] = Handwrite HoughLines(Im, num lines), where Im is your accumulator of Hough transform, and num_lines is the number of lines to detect. Outputs rhos and thetas are both 1-D vectors with length of num_lines that contain the row and column coordinates of top-num lines peaks in the accumulator Im, that is, the lines found in the image.Here is an important note but do not spend your time implementing. For every cell in the accumulator corresponding to a real line (likely to be a locally maximal value), there will probably be several cells in the neighborhood that also scored high but should not be selected. This would cause a series of similar lines to be detected. Thus these non-maximal neighbors can be removed using non-maximal suppression. Here we prepare a non-maximal suppression function tool for you to use, or you can implement it by yourself.Once you have ρ and θ for each line in an image, the main script would draw red lines representing detected edges in your image based on the values of ρ and θ. The green line segments represent the output of OpenCV’s HoughLinesP (see Figure 2 as an example).Submission Format. Please submit your filled myHoughtransform.py, myHougnlines.py. Write and analyze your experiment results of above tasks in the write-up. X-4 Figure 2: A sample visualization of line detection result.The Scale-Invariant Feature Transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images. Implement SIFT will help you to get familiar with image descriptor.Requirements. Complete the getDoGImages and computeGradientAtCenterPixel function in SIFT.py. And use the SIFT function in action with a template matching demo given by matching with SIFT.py. Let’s briefly go over SIFT and develop a high-level roadmap of the algorithm. You can (and should) read the original paper HERE. SIFT identifies keypoints that are distinctive across an image’s width, height, and most importantly, scale. By considering scale, we can identify keypoints that will remain stable (to an extent) even when the template of interest changes size, when the image quality becomes better or worse, or when the template undergoes changes in viewpoint or aspect ratio.Moreover, each keypoint has an associated orientation that makes SIFT features invariant to template rotations. Finally, SIFT will generate a descriptor for each keypoint, a 128-length vector that allows keypoints to be compared. These descriptors are nothing more than a histogram of gradients computed within the keypoint’s neighborhood.The entire function of SIFT includes • img – a grayscale input image; • getBaseImage – appropriately blur and double the input image to produce the base image of our “image pyramid”, a set of successively blurred and downsampled images that form our scale space; • computeNumberOfOctaves – compute the number of layers (“octaves”) in our image pyramid. Now we can actually build the image pyramid; • getGaussianKernels – create a list of scales (gaussian kernel sizes) that is passed to getGaussianImages; • getGaussianImages – repeatedly blurs and downsamples the base image; • getDoGImages – subtract adjacent pairs of gaussian images to form a pyramid of difference-ofGaussian (“DoG”) images, you need to implement this function;• findScaleSpaceExtrema – identify keypoints based on the DoG image pyramid. We iterate through each layer, taking three successive images at a time. Remember that all images in a layer have the same size — only their amounts of blur differ. In each triplet of images, we look for pixels in the middle image that are greater than or less than all of their 26 neighbors: 8 neighbors in the middle image, 9 neighbors in the image below, and 9 neighbors X-5 in the image above. The function isPixelAnExtremum performs this check. These are our maxima and minima (strictly speaking, they include saddle points because we include pixels that are equal in value to all their neighbors).When we’ve found an extremum, we localize its position at the subpixel level along all three dimensions (width, height, and scale) using localizeExtremumViaQuadraticFit, explained in more detail below. The code to localize a keypoint may look involved, but it’s actually pretty straightforward. It implements verbatim the localization procedure described in the original SIFT paper. We fit a quadratic model to the input keypoint pixel and all 26 of its neighboring pixels.We update the keypoint’s position with the subpixel-accurate extremum estimated from this model. We iterate at most 5 times until the next update moves the keypoint less than 0.5 in any of the three directions. This means the quadratic model has converged to one pixel location. The two helper functions computeGradientAtCenterPixel and computeHessianAtCenterPixel implement second-order central finite difference approximations of the gradients and hessians in all three dimensions. you need to implement function computeGradientAtCenterPixel;• removeDuplicateKeypoints – clean up these keypoints by removing duplicates; • convertKeypointsToInputImageSize – converting them to the input image size; • getDescriptors – generate descriptors for each keypoint. Now we’ve found the keypoints and have accurately localized them. Next, we need to compute orientations for each of our keypoints. We’ll likely also produce keypoints with identical positions but different orientations.Then we’ll sort our finished keypoints and remove duplicates. Finally, we’ll generate descriptor vectors for each of our keypoints, which allow keypoints to be identified and compared. Now let’s see SIFT in action with a template matching demo. Given a template image, the goal is to detect it in a scene image and compute the homography. We compute SIFT keypoints and descriptors on both the template image and the scene image, and perform approximate nearest neighbors search on the two sets of descriptors to find similar keypoints. Keypoint pairs closer than a threshold are considered good matches. Finally, we perform RANSAC on the keypoint matches to compute the best fit homography. A sample result is shown as follows Figure 3: Output plot of matching_with_SIFT.py, showing the detected template along with keypoint matches.Submission Format. Please submit your filled SIFT.py. Write and analyze your experiment results of above tasks in the write-up. X-6Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors D to represent an input vector x (t) such that we can: i) find a sparse latent representation h (t) to represent x (t) as a linear combination of these basis vectors, which has many zeros; ii) good reconstruct the original input x (t) . To achieve this goal, the objective function is given by minD 1 T X T t=1 minh(t) 1 2 ∥x (t) − Dh(t)∥ 2 2 + λ∥h (t)∥1 (1) where T is the amount of sample, 1 2 ∥x (t)−Dh(t)∥ 2 2 is the reconstruction error, λ∥h (t)∥1 is the sparsity penalty, and λ is the hyperparameter that controls the sparse penalty strength.Requirements. Implement image denoising with sparse coding. Display the output of sparsecoding.ipynb. We recommend to learn the sparse coding based on the DCT dictionary you got in Q1. To solve this objective function, we provide a sample solver MatchingPursuit in persuits.py.You could choose one of the following tasks 1. substitute this solver with ADAM optimizer in Pytorch, analyze the optimizing process; 2. analyze the effect of different choice of DCT dictionary on the image denoising task. Submission Format. Please submit your code, write and analyze your experiment results of above task in the write-up. X-7
Use Python to achieve the following functionality. Given an array of integers array and an integer goal, return indices of the two numbers such that they add up to goal. We assume that each pair of input array and goal has one and only one solution, and you cannot use the same element twice. Please return the answer in an increasing order. Example 1: • Input: array = [2,2], goal = 4 • Output: [0,1] • Explanation: Because array[0] + array[1] = 4, we return [0, 1]. Example 2: • Input: array = [2,3,11,15], goal = 5 • Output: [0,1] • Explanation: Because array[0] + array[1] = 5, we return [0, 1]. Submission Format. Please fill the AddUp function and submit AddUp.py.The aim is to guide you to know the functionality of Numpy. Installation. Please install Numpy and show your Numpy version in a screenshot. Task. Please solve 20 problems in numpy.py. Many only need one-line code. X-1 Submission Format. Please submit your filled numpy.py. Show your screenshot of Numpy version in the write-up.The aim is to guide you to know the functionality of SciPy. Installation. Please install SciPy and show your SciPy version in a screenshot. Task. Please solve 5 problems in scipy.py. Many only need one-line code. Submission Format. Please submit your filled scipy.py. Show your screenshot of Scipy version in the write-up.The aim is to guide you to know the functionality of Matplotlib. Installation. Please install Matplotlib and show your Matplotlib version in a screenshot. Task. Please solve 5 plotting tasks in matplotlib.py. Submission Format. Please submit your filled matplotlib.py. Show your screenshot of Matplotlib version and results of function w1-w5 in the write-up.The aim is to guide you to know the functionality of Pytorch. Installation. Please install Pytorch and show your Pytorch version in a screenshot. Task. Please follow the instruction and fill in blanks in pytorch.py and complete your first convolutional neural network to handle image classification. • Read the demo code and fill in the blank noted by “”” xxx here “”” in pytorch.py • Vary the training epoch as 1, 2, 4, 8, 16 and plot the accuracy of the “plane” category as a function of the training epoch. • Vary the learning rate as [10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1] and plot the accuracy of the “ship” category as a function of the learning rate. • Let’s try another loss function: Mean Squared Error(MSE loss). Compare the performances.Submission Format. Please submit your filled pytorch.py. Show your screenshot of Pytorch version in the write-up. Write and analyze your experiment results of above tasks in the write-up. X-2
The goal of this project:Penetration testing is an important part of ensuring the security of a system. This project introduces some of the common tools used in penetration testing, while also exploring common vulnerabilities (such as Shellshock and setUID bit exploits).On September 24, 2014, a severe vulnerability in Bash, nicknamed Shellshock, was identified. This vulnerability can exploit many systems and be launched either remotely or from a local machine. In this project, you will gain a better understanding of the Shellshock vulnerability by exploiting it to attack a machine. The learning objective of this project is to get first-hand experience on this interesting attack, understand how it works, and think about the lessons that we can get out of this attack.Environment Setup:Files Provided Description shellshock_server.ova The VM image. assignment_questionnaire.txt Template for final submission.This project requires the use of virtual box and multiple VMs.Installing Virtual Box:1. Install VirtualBox if it is not already installed. This project requires the latest version of VirtualBox. Using an earlier version of Virtual Box has been known to cause errors where the project VM freezes, so if you run into this, ensure you are running on at least Virtual Box 6.1.0 or later.2. Download the Oracle Virtual Box Extension Pack (available for download at the same location as Virtual Box is).3. Import the Extension Pack under File-> Tools -> Extension Pack Manager4. In Virtual Box go to File -> Tools -> Network Manager5. Select the NAT network tab and click on Create. Right click on the newly created NAT network, click on Properties and rename it to something related to the project. Then save it and close the preferences.Installing the Kali Linux VM (feel free to use any other OS but we recommend using Kali):6. Download the 64bit VirtualBox version of the Kali VM from the link: kali download.7. Extract the above zip and double click on the vbox file. Once you have imported the VM, you’ll need to go into the VM Settings and increase the number of CPUs if possible, as well as the RAM. Also enable 3-d acceleration and set the zoom level to 300 (it makes it easier to read).8. Go to the new Kali VM’s settings. In the network tab change “Attached to:” from NAT to “NAT Network” in the Adapter 1 tab. The name of the network should autofill to your newly created network if you only have one, but if you have multiple NAT networks, you’ll need to select the correct one.9. Start the Kali VM. The default username/password is kali/kali.10. Repeat the process with the other VM (shellshock_server.ova) which you will download in a later step so that these two VMs can communicate with each other.Vulnerable machine: 11. Download the project appliance using the link below: 1. shellshock_server.ova sha256sum: b8729307ad6849d17c6b88e8a5893d5f7a7ccf3167d9279e4067039379be170312. Double click the downloaded ova file to start its import process.13. Once imported, adjust the CPU and RAM for this VM like you did for the Kali VM. Ensure Adapter 1 is set to NAT Network and the name of the NAT Network is the name you specified when creating the NAT Network in step 5. 14. Boot up the shellshock_server and you’re good to go.NOTE: You do not need to into the shellshock VM. Once you see the login screen, you’re good to go. Leave the shellshock VM running and switch to the Kali VM for the rest of the project now.15. Now try to connect to the shellshock VM server from Kali. Once you find the IP of the VM in Task 1 below, navigate to the following URL: http://:/cgi-bin/shellshock.cgi Then you should be able to see the content:16. Notice that you do not need the password to access the web content hosted on the VM server. Instead of using a real server, it is safer to perform the attack on an emulated Apache HTTP Server VM. To be clear, you will not be logging into the shellshock VM during this project. Once you configure the shellshock VM (by following steps 17-20 above), you will be exploiting it externally, from the Kali VM, by exploiting the Shellshock vulnerability.Project Tasks (100 points): Task 1: Network Scanning – (10 points)The first step in any penetration test is to gather information about the network and servers you’ll be exploiting. In this task, you will perform network scanning and answer a few questions based on your findings. On startup, the shellshock VM listens to several ports for incoming messages.1. Find the IP address of the vulnerable VM on the NAT network using nmap. 2. Use nmap to identify all the open ports on the shellshock server and submit the port number which handles the http traffic to the Apache web server on the VM.Note: If possible, please use VirtualBox for this part since in the previous semesters VMware has been known to cause some issues with the network setup and Task 1.Task 2: Attack CGI Program– (20 points)In this task, you will launch the Shellshock attack on a remote web server. Many web servers enable CGI, which is a standard method used to generate dynamic content on Web pages and Web applications. Many CGI programs are written using a shell script. Therefore, before a CGI program is executed, the shell program will be invoked first, and such an invocation can be triggered by a user from a remote computer.To access this CGI program from the Web, you need to first check the server VM is running. Then, you can either use a browser by typing the following URL:http://:/cgi-bin/shellshock.cgior use the following command line program curl to do the same thing:$ curl http:// :/cgi-bin/shellshock.cgiFor this task, your goal is to launch the attack through this URL, such that you can achieve something that you cannot do as a remote user. For example, you can execute some file on the server, or look up some file located on the server. When you successfully launch an attack, please execute the /bin/task2 program (which needs your GT username as the input) inside the VM. It will generate the submission hash for you.For students that want to verify their work, here’s an example correct input/output for /bin/task2:$ /bin/task2 gburdell3 Here is your task2 hash: 5732307cf1ef49dd2613e9bbe28dd3e0ea907c28d6c8736faf0c9537c2b20c5aTask 3: Reverse Shell with Metasploit – (25 points)Now you have successfully launched the Shellshock attack, and you can execute commands on the server VM. However, during a penetration test, you likely won’t have time to craft a payload for every exploit you use. And, what if the server was not in fact vulnerable to shellshock. How would you know if your exploit failed because it was wrong, or because there was not a vulnerability?That is where Metasploit comes in. Metasploit is a standard in the penetration testing community. It allows pen testers to run precompiled exploits against a host, using predefined payloads. For this project, we’re going to use Metasploit to establish a reverse shell between your machine and the host machine.Let us first see how Metasploit works by using it to scan the open TCP ports of the shellshock_server VM. While this is a task better suited to tools like nmap (as seen in Task 1), it serves as a good demonstration of how to use a Metasploit module. If you have used Metasploit before, you can skip this introduction and go directly to the Task 3 assignments section at the end.msfconsole 1. Begin by opening a new terminal on your Kali VM. In the new terminal type: . After a moment, the Metasploit Framework console (msfconsole) will load. For this project, the msfconsole is the main way of accessing Metasploit. While there are other tools and command prompts associated with Metasploit, the msfconsole is suitable for the entirety of this project.2. For this example, we’re going to scan the ports of a host. You can use the results from task 1 to ensure Metasploit is behaving properly. In practice using a Metasploit module solely for the purpose of scanning ports on a host is a little overcomplicated, since nmap can do much more and takes less setup, but this does offer a good introduction to using a Metasploit module.3. A metasploit module is the base of any task performed in metasploit. It consists of Ruby code that is written to perform a certain task (like exploiting a certain vulnerability or scanning a certain kind of system). There are multiple different varieties of modules, but for our purposes we’ll focus on three of them: a. The Exploit modules: These are modules written to exploit a certain vulnerability. b. The Auxiliary modules: These are modules written to perform some tasks related to exploiting a system (like scanning). Within the auxiliary modules there are many kinds of modules. We’ll be using the “scanner” modules for this example. c. Payloads: Calling these modules is a little misleading. They are the payloads (for example the shellcode) that are sent within the exploit.The reason there are so many results is that “scanner” is a class of auxiliary modules (as seen by there being so many modules beginning with “auxiliary/scanner”). So, searching for “scanner” (or “scan”) will give everything at all related to network or vulnerability scanning.That seems a little better. Only 7 results. Since we’re scanning for open tcp ports, let’s use: “auxiliary/scanner/portscan/tcp”. To use a metasploit module, you should run the command: use module_nameYou should now see “auxiliary(scanner/portscan/tcp)” after “msf5” indicating you are using the auxiliary(scanner/portscan/tcp) module.6. Now that we’re using the correct module, we must set the correct options. Try running the command options . You should see something like:While each of the options is marked as required, most of the pre-filled values work for our purposes. The only one we need to change is RHOSTS. RHOSTS should be the IP address of the host we want to scan. In this case, you should set it to the IP address of the shellshock_server VM, that you found in part 1. You can use the command: set rhosts IP_of_host. Now try running options again and ensure the RHOST option is set to the right IP address.Now run run and you should see the same list of open ports that nmap showed. While “run” is usually used to run auxiliary exploits, the command “check” and “exploit” are often used to check 7. and run exploit modules.Now you’ve seen an example of how to use Metasploit. You’ll follow a similar process when exploiting the shellshock vulnerability.Task 3 Assignments: 1. Find an exploit module that exploits the shellshock vulnerability on an Apache web server. Once you’ve found the module, place the module name in assignment_questionnaire.txt. 2. Use show payloads to show the possible payloads for the module. Find a payload that spawns a reverse tcp shell. Place the full name of the payload in assignment_questionnaire.txt. 3. Run the exploit and spawn a reverse shell on the VM. 4. Run /bin/task3 in the resulting shell, then type cs6262 then your user ID. Report the hash value for your user ID in assignment_questionnaire.txt.You’ll submit all your answers for this section in assignment_questionnaire.txt. You should keep the reverse shell running after finishing Task 3, as you will need it in Task 4.Task 4: Privilege Escalation – (20 points)Your goal: You aim to upgrade the privilege for your command shell by exploiting the setUID vulnerability. You will run /bin/task4 as the higher privileged user “shellshock_server”, not the default user “www-data”.Background: In Unix based systems, setUID is access rights flags that determine what users can run a program. For instance, when users want to change their password, they may run the passwd that requires root privilege. The setUID can help the user run the passwd with temporarily elevated privileges. However, if setUID is misused or setUID flags are misconfigured, it can cause a variety of vulnerabilities such as information leakage, unwanted privilege escalation, etc.Assignments: in your shell. You should see “www-data” which is your user ID. Now, run As a first step, type whoami /bin/task4 gt_usernam e. You would see a permission denied error. That is because /bin/task4 is configured to allow only the “shellshock_server” user to run it. So, you need to find a way to run task4 as the “shellshock_server” user. A feasible approach is to spawn a shell running as a “shellshock_server” user and run task4 through it.Your goal is to find a program which: 1. Has a higher privilege than the default user. 2. Can spawn a shell.Useful Resource: https://gtfobins.github.io/You may want to ransack /usr/bin for a program which has a higher privilege than the default user and run /bin/task4 gt_username in the shell spawned.What is the vulnerable program? What command do you use to search it? What command do you use to spawn a shell with the vulnerable program? And what is the hash value from /bin/task4 gt_username (like /bin/task2)? Please leave your answers in assignment_questionnaire.txt.NOTE: If you exploit a symlinked binary, submit the name of the target binary that is linked. Eg: If binary A points to binary B, submit binary B as the answer. Task 5: Password Cracking – (25 points)An invaluable part of any penetration test is password cracking. While there may be no known vulnerabilities in a system, a weak password could be just as damaging in allowing an attacker to gain access to a system (or view sensitive information once they gain access). We’re going to look at two kinds of weak passwords in this task: passwords that are too short, and passwords that can easily be guessed via password scraping.To begin, start a Meterpreter shell (using a meterpreter shell payload) through the Metasploit shellshock module in Task 3. A Meterpreter shell is different from the reverse TCP shell in Task 3, as it allows you to run Metasploit specific commands on the vulnerable machine (like download). Navigate to /home/shellshock_server/secret_files/. There are two encrypted .pyc files here. task51.zip is encrypted with zip, while task52.pyc.gpg is encrypted with gpg (a common file encryption tool in Linux). Download these two files (task51.zip and task52.pyc.gpg) to your Kali VM using the meterpreter.We already know the developers of this web server are not very security savvy, since they let a shellshock vulnerability plus a setUID exploit give a high privilege shell on their machine. So, chances are they did not pick very secure passwords for these secret files. Your goal in this task is to crack the passwords of these two files using John the Ripper (a popular password cracker) and cewl (a password scraper).The command line tools used in Task 5 are in /usr/sbin on the Kali VM. To run them, you can either add /usr/sbin to the $PATH variable or write /usr/sbin/ before each command.You should use zip2john and gpg2john to extract the password hashes from the encrypted files. For task51.zip, try running John the Ripper incrementally. Report your John the Ripper command in assignment_questionnaire.txt (whether you also report your the zip2john and gpg2john commands is up to you, but they will not be graded).For task52.pyc.gpg, try running John the Ripper incrementally again. Hmm… it seems to run forever. That is because John the Ripper is trying every combination of characters. If the password is too long (among other things), John the Ripper could run for years before it finds it.Just because the password is too long to be found incrementally, does not mean it cannot be cracked. Take a look at the shellshock.cgi page in the browser. It looks like it gives a link to a profile of the authors. If the authors are not great at picking secure passwords, maybe the password is something about them that we can guess from their profile page.But, even if the password is on the profile page, it can still take a while to guess by hand. What if the password was kItt3n$ or deVEL0p3r. It would be hard to guess that, even if the word it was based on (like “kittens”, or “developer”) was on the profile page.Instead, let us use a password scraper to create a custom wordlist for John the Ripper. For this project, we suggest using cewl. cewl is a simple command line program that comes preinstalled on Kali and creates password word lists off of webpages. Since we want to get every possible password (including if the authors based the password off something on shellshock.cgi, or one of the landing pages), you should run cewl on shellshock.cgi, with the proper settings to ensure it follows the links all the way to profile.cgi (you may need to tune the cewl parameters). Report your cewl command in assignment_questionnaire.txt. Then try running John the Ripper on task52.pyc.gpg with the wordlist (and the default wordlist rules for testing different permutations of the words). Report your John the Ripper command in assignment_questionnaire.txt.Once you find the passwords, report them in assignment_questionnaire.txt. Then decrypt the two files (using zip for task51.zip and gpg for task52.pyc.gpg) and run python2.7 task51.pyc gt_username and python2.7 task52.pyc gt_username. Report the resulting hash values for your user ID in assignment_questionnaire.txt.Deliverables:Please use Gradescope to submit the assignment files. The link to Gradescope is found in Canvas under Courses tab -> Gradescope. In Gradescope under active assignments click project 1 to upload your files. ● assignment_questionnaire.txt ● README.txtThe provided template “assignment_questionnaire.txt” marks where you should add your answers for each question. Please DO NOT edit anything other than the fields marked for your answers (or student ID) in assignment_questionnaire.txt. Doing so may result in the autograder failing to process your submission.Please note that there is a 5-point penalty for not following the format given in assignment_questionnaire.txt.You should also include a “README” plain text file explaining how and why each piece of your solution works. Including it makes grading easier if we cannot reproduce your results. You may also include screenshots in your submission to show proof. To include screenshots (optional), upload them to your GT SharePoint drive and add a link in the README.txt file. If you use any outside sources not mentioned in this write-up, then README would be a good place to mention them as well. The README.txt file is a plain text file, and no other formats are accepted.FAQ:● For this project, students have unlimited submissions on gradescope. ● You do not need to log into the shellshock image at any point to complete this project. ● For all task binaries, make sure to pass in your GT Username as the argument (e.g.: gburdell3) and beware of extra spaces after your name because that will result in an incorrect hash. ● Make sure you are using atleast VirtualBox 6.1.x. (students have completed the project on other versions too, but it sometimes causes issues). ● Make sure to assign at least 2 CPUs and 4GB of RAM to your shellshock image for it to function properly. ● The objective of this project is to get you familiarized with the absolute basics of penetration testing and therefore we encourage Googling to learn more about the vulnerability, the tools, and the attack concepts.Reminders:Please submit the files EXACTLY as requested to Canvas. DO NOT package them up (e.g., as a ZIP file). Any deviations may result in a deduction of your grade.There is a 5-point penalty for not following the submission format shown.Useful Links and References:● Shellshock Vulnerability ◦ https://github.com/carter-yagemann/ShellShock ◦ https://en.wikipedia.org/wiki/Shellshock_(software_bug) ◦ http://seclists.org/oss-sec/2014/q3/650 ● curl ◦ https://curl.haxx.se/docs/manpage.html ◦ https://curl.haxx.se/download.html (curl.exe for Windows) ● netcat ◦ https://linux.die.net/man/1/nc ◦ https://eternallybored.org/misc/netcat/ (nc.exe for Windows) ● nmap ◦ https://nmap.org/book/man.html ◦ Service and Version Detection | Nmap Network Scanning ● Metasploit ◦ https://www.offensive-security.com/metasploit-unleashed/metasploit-fundamentals/ ● John the Ripper ◦ https://www.openwall.com/john/doc/ ● cewl ◦ https://tools.kali.org/password-attacks/cewlAcknowledgment:This Lab was modified from SEED Labs, Copyright 2014 Wenliang Du, under the terms of GNU Free Documentation License, Version 1.2. The original document can be found at http://www.cis.syr.edu/~wedu/seed/Checklist/Rubric:Section Points ✓ 1 Network Exploration 10 1.1 Correct first digits of the IP Address of the vulnerable VM. 5 1.2 Correct HTTP port. 5 2 Exploiting the System 20 . 1 Command used to exploit the shellshock vulnerability. 5 2.2 Correct hash value from running /bin/task2. 15 3 Spawning a Shell with Metasploit 25 3.1 Correct exploit module 5 3.2 Correct payload module (there are multiple correct answers here) 5 3.3 Correct hash value from running /bin/task3. 15 4 Privilege Escalation 20 4.2 Correct vulnerable program name. 10 4.4 Correct hash value from running /bin/task4 10 5 Password Cracking 25 5.2 Correct password for task51.zip. 5 5.3 Correct hash value from running python task51.pyc. 7.5 5.6 Correct password for task52.pyc.gpg. 5 5.7 Correct hash value from running python task52.pyc. 7.5 – Possible Deductions i. Incorrect assignment_questionnaire.txt format. -5 ii. Incorrect upload to Gradescope/Canvas. -5 + Your Submission Includes Total: 100 assignment_questionnaire.txt – Answers to questions README.txt
(10 points total) Your band is finally earning some money! Not a lot, but you feel it is time to automate the computation of the salaries of the band members. You started to implement a static method to compute the salary (see Salary.java), but the other band members complained that it is very buggy. One even implied that it was not tested at all (how dare they).In this assignment, you need to improve the implementation of the pay() method, and, given the importance of the correctness of this method, provide a test suite for the method to make sure that you are not incorrectly paying the band members.The method has three parameters: • salary: The base salary earned by a band member during a salary period (in dollars). • snacksAmount: The total amount spent by a band member on snacks during tours and shows during a salary period (in dollars).• bonus: The bonus percentage that the member earned this salary period (in percentage). The pay is computed by deducting the amount spent on snacks from the base salary, and then computing and adding the bonus (if any) over that amount. So, if a band member earned 100 dollars, spent 50 on snacks, and earned a bonus of 10 (%), their pay will be 55 dollars.The pay parameters have the following properties (aside from ‘common sense’ properties, which you should figure out yourself): • The base salary can be maximum $1,000. • The bonus can be between 0 and 10 (%). • A band member cannot spend more on snacks than their base salary. • Because of a strange compatibility issue with your accountant’s software, the salary and snacksAmount are both Double objects, and the bonus is specified as an Integer object. The pay method must return a Double object. You cannot change this as your accountant will get mad.For this assignment, you need to: 1) Improve the pay() method given the above requirements. 2) Create a test plan to test the pay() method. The test plan consists of a table with the parameter values and the expected output of the method. Also, please indicate in which test method you cover the case. 3) Implement your test plan as JUnit test class TestSalary. Make sure to give your test cases descriptive names, to make it easier to debug your code if a test fails. Please do not modify any of the names/methods we’ve defined in the provided *.java files. You will get 0 points for this assignment if you do.Hints: • Because you’re doing computations with Doubles (i.e., you’re doing floating point arithmetic), there will be some precision errors. JUnit allows you to specify a delta to verify whether the result is within a certain error range (https://junit.org/junit5/docs/5.0.1/api/org/junit/jupiter/api/Assertions.html#assertEqualsdouble-double-double-). For this assignment, you can set delta to 0.001.• For invalid results, your pay() method should throw an IllegalArgumentException. Make sure to include a message that explains why the exception is thrown (you can decide exactly which message). • You are not allowed to change the signature of the pay() method. If you do, you will get 0 points for this assignment.Rubric (20 points total) (16 points) Correct implementation of pay() method. Your pay() method will be tested using our extensive JUnit test suite. (4 points) Test plan If you do not submit a test plan and/or test suite, or they do not make sense or do not match your implementation, we can deduct more points.Please submit: 1) A zip file containing your improved Salary class, your test suite and a PDF with your test plan. Name the file ‘FirstName_ID_asg5.zip’ and keep the exact same file structure as the zip that was provided for the assignment. You can store the test file in the same location as the Salary class. For example, Filename: Cor-Paul_1234567_asg5.zip |——- testplan.pdf |——- src | |——- ece325 | | |——- lab | | | |——- assignment5 | | | | |——- Salary.java | | | | |——- TestSalary.java 2) A screencast/movie that shows the following steps: • Open your eClass with your name shown • Open your IDE • Show your code briefly • Execute your test suite and show that all tests pass. Please do not modify any of the names/methods we’ve defined in the provided *.java files.
With a better understanding of the cool things you can do with Collections in Java, you decide to once more improve the implementation of your SongCollection (actually, you decide to write it from scratch again, so no need to look at the old code). Hopefully, this is the last time, but as you probably know, there are no guarantees in the life of a neoclassical jazzhopmetal artist.– You want to use this SongCollection to easily filter duplicates in your input data. Duplicate songs (i.e. those with the same title) should end up only once in the SongCollection. – Also, you want to use this SongCollection to easily sort your songs in the following ways: – By their natural ordering: ordered by title from A-Z – By an alternative ordering: ordered by average rating from high to low (and then for songs with equal ratings, by number of votes from high to low)Based on these requirements, you decide to use a TreeSet instead of an ArrayList as the internal data structure for the SongCollection. In this assignment, you will implement the application.1) Let Song implement the correct interface for usage in a TreeSet and finish the implementation for the sorting. The natural ordering of songs is by their titles (from A-Z). 2) Implement the loadSongs() method in the SongCollection class. You can load the songs directly into the collection. Your implementation must use a BufferedReader and a Scanner. The songs and their ratings are in the songratings.txt file (one song per line) in the following format: Title;rating;votes So, in comparison to the previous assignment, no need to compute the average rating – you can load it directly from the file. Make sure that your program doesn’t break when there are malformed inputs in the input file. For this assignment, it is OK to just silently ignore any malformed inputs – your program should not cause any unhandled exceptions.3) Sometimes, we want to see the songs ordered by their average rating (of course with the highest rated song first). For two songs with the same rating but with a different number of votes, the song with the highest number of votes should be ranked the highest. Finish the implementation of the sort() method (and add any class(es) you may need).4) In the future, you may need to store the songs in a HashSet. So you have to prepare the implementation for that as well (= add the right methods, see the lecture slides for which ones). Implement the methods that are required to use your Song class in a HashSet.5) Let the main() method (in SongCollection) of your program do the following: – Create a new SongCollection – Load the songratings.txt file – Print the songs in the song collection (using their natural ordering, from A-Z) – Print the songs in the song collection (ordered by average rating as described in (3)) – Demonstrate that you can use your Song class in a HashSet (you can just call demonstrateHashSetUsage() for this – you don’t have to change this method but make sure that the output is correct though!).Hints: – Make sure to implement your comparison code in the right classes, and to reuse as much functionality as possible from existing classes. – You can assume that there are never records in the txt file that have the same title but different ratings/votes (there can be completely duplicate records though, or records that do not have the right format).Rubric (20 points total) Please submit: 1) A zip file containing your code and a PDF with the answers to the questions above. Name the file ‘FirstName_ID_lab_asg4.zip’ and keep the exact same file structure as the zip that was provided for the assignment. So, for example: Filename: Cor-Paul_1234567_lab_asg4.zip | songratings.txt |——- src | |——- ece325 | | |——- labs | | | |——- lab4 | | | | |——- *.java 2) A screencast/movie that shows the following steps: • Open your eClass with your name shown • Open your IDE • Show your code briefly • Execute your code and demonstrate that your program works correctly (= show the execution of the main() method). Please submit the screencast as a separate file to eClass. Please do not modify any of the names/methods we’ve defined in the provided *.java files.
Now that you have played some gigs, you can use audience feedback to help you decide which song and instrument combinations are great and which ones aren’t. For every song you play with a certain set of instruments, you recorded the following in a text file: – The Song title – The list of instruments used to play it – The audience rating (between 0 and 10 with 10 being the best) Because audiences differ, you have multiple ratings for some songs. The SongCollection will hold Song objects (that have a title, a list of instruments used to play it and a rating).Finish the implementation to load the SongCollection (including the average ratings) from the songratings.txt file in the provided classes. A Song should occur only once in the SongCollection – make sure to update the average rating when you encounter multiple ratings for the same song.So, for example: Contribution;Guitar,Guitar,Drums;8 Contribution;Guitar,Guitar,Drums;10 Should be stored in the SongCollection as a Song with title “Contribution”, instruments “Guitar”, “Guitar” and “Drums” and an average rating of 9.Hints: – SongCollection uses an ArrayList internally to hold the Song objects. – You will need to override the equals() method in the Song class. For now you can assume that two Song objects are equal if their titles and list of instruments are the same.– To make your life easier, you can treat the list of instruments as an ArrayList of String objects. So e.g. “Guitar,Guitar,Mic” should result in an ArrayList with the Strings “Guitar”, “Guitar” and “Mic”. The order of the instruments has no meaning, so “Guitar,Guitar,Mic” should be considered the same as “Guitar,Mic,Guitar” by your application. Note that a song can be played with different combinations of instruments!– The equals method(s) in the Song class are quite important for the correct functioning of the application. Make sure that they do not crash on unexpected input. – The Collections class in the Java API contains some functionality that can helpful for your application.Please answer these questions about the implementation/design choices. There is no need to change your implementation based on these questions, a textual explanation is fine. 1) The Song class contains two equals() methods. Explain the difference using your knowledge about inheritance. 2) Why do we need the version of equals() that takes an Object as the parameter in this class?Rubric Implementation of the application 16 points total. Questions 2 points each. (20 points total) Please submit: 1) A zip file containing your code and a PDF with the answers to the questions above. Name the file ‘FirstName_ID_lab_asg3.zip’ and keep the exact same file structure as the zip that was provided for the assignment. So, for example: Filename: Cor-Paul_1234567_lab_asg3.zip |——- solution.pdf |——- src | |——- ece325 | | |——- labs | | | |——- lab3 | | | | |——- *.java2) A screencast/movie that shows the following steps: • Open your eClass with your name shown • Open your IDE • Show your code briefly • Execute your code and demonstrate that your class works correctly. Please submit the screencast as a separate file to eClass. Please do not modify any of the names/methods we’ve defined in the provided *.java files.
Your neoclassical jazzhopmetal band is becoming more popular, which is great! But, this popularity comes at a cost: you get invited to play lots of shows. To make sure that you don’t forget any of your equipment after a show, you decide to create an application that helps you keep track of your equipment. In this assignment, you will implement the application and answer some questions about the design.1) The Equipment classes: The following types of equipment exist: instruments (guitars and keyboards) and furniture (stools and chairs). You want to be able to distinguish between these types (e.g., because you only have to bring instruments to some concert venues that already have the furniture you need). Design these classes so that it is easy to add new types of Equipment later on. To make keeping track of the number of equipment in the inventory easier, make sure to add a toString() method that returns the type of the equipment (e.g. “Guitar” or “Equipment”). Implement the Equipment class and add the other classes that you need.2) The EquipmentInventory class: The EquipmentInventory consists of two components: – An ArrayList to keep track of all your Equipment, – A HashMap to keep track of the number of every type of equipment that you have (so that you can easily display these numbers). Check the Java docs to see how to use a HashMap – it is basically a dictionary of key-value pairs. In this case, the key is a String and the value is an Integer. In your application you decide to use the toString() method of a piece of Equipment as the key. So, an example key-value pair could be “Guitar”,3. Implement the EquipmentInventory class. Make sure to follow the specification in the comments in the provided EquipmentInventory class. Hints: – Make sure you cannot accidentally add the same equipment twice. – Make sure to update the HashMap after adding/removing a piece of equipment.3) Do a test run of your application: You own the following equipment (so add these to your EquipmentInventory in your application’s main() method): 3 guitars, 2 keyboards, 3 stools, 1 chair After adding the equipment, print the contents of the inventory and verify that the numbers are correct. Your output should look as follows: [EquipmentInventory: Guitar: 3, Stool: 3, Chair: 1, Keyboard: 2] (the order of the equipment types may differ, as long as the numbers and the String formatting are the same). 4) Show that you can also remove equipment from the inventory: You end up selling one keyboard and one stool. Update your inventory and show it again.Please answer these questions about the implementation/design choices. There is no need to change your implementation based on these questions, a textual explanation is fine.* 1) Why do we choose to implement add(Equipment e) instead of a more specific method like add(Guitar g)? 2) Why do we use a String instead of an Equipment object in the HashMap? 3) Why do we use an Integer instead of an int in the HashMap? 4) Why do we need to define a custom toString method for the Equipment types? What happens if we use the default method? 5) What happens if we only define a custom toString method for the Equipment or Instrument class (but not for any of the subclasses)? Rubric Implementation of the application 15 points total.Questions 1 point each. (20 points total) Please submit: 1) A zip file containing your code and a PDF with the answers to the questions above. Name the file ‘FirstName_ID_lab_asg2.zip’ and keep the exact same file structure as the zip that was provided for the assignment.So, for example: Filename: Cor-Paul_1234567_lab_asg2.zip |——- solution.pdf |——- src | |——- ece325 | | |——- labs | | | |——- lab2 | | | | |——- *.java 2) A screencast/movie that shows the following steps: • Open your eClass with your name shown• Open your IDE • Show your code briefly • Execute your code and demonstrate that your class works correctly. Please submit the screencast as a separate file to eClass. Please do not modify any of the names/methods we’ve defined in the provided *.java files.
(20 points total) You and a few friends decide to start a band. You already agreed on the music style (neoclassical jazzhopmetal, it’s pretty experimental), but you have yet to decide on the most important thing: a band name. Since the band name is very important and to exploit your newly-acquired knowledge in Java, you proposed to create a band name generator in Java.You agree that the name must consist of two adjectives followed by a noun, which is a common combination for neoclassical jazzhopmetal band names. So, an example could be ‘Charming Chubby Cat’.(15 points) To implement the band name generator, you have to complete the BandNameGenerator.java file (see the file on eClass). Make sure that it compiles and runs without errors and warnings.Some notes/hints: – You are free to add fields/methods as needed. Make sure that you do not change the signatures of the methods that are already defined – we will deduct points if you do so. – Use the provided nouns.txt and adjectives.txt files for a large list of nouns and adjectives.– Because we have not discussed reading/writing files yet, a method loadTxt() is provided that loads a txt file into an array of String objects. The txt file is assumed to contain the total number of rows on the first line, and one word per line in the rest of the file. You are not allowed to change this method, and you can use it following assignments as well for loading text files.– You can create a random number by calling Math.random(). See for more information the docs (https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Math.html) – Print 20 randomly generated names (no more, no less). – Every word in your band name should start with an uppercase letter and contain only lowercase letters after that. So, e.g., ‘CHARMING chubby Cat’ would become ‘Charming Chubby Cat’.(5 points total) Please answer these questions about the implementation/design choices. There is no need to change your implementation based on these questions, a textual explanation is fine.* (3 points) 1. Why do the adjectives and nouns files contain the total number of words on the first line? What would happen if we tried to load a file that does not contain this number? What would need to change in the code to avoid this problem?(2 points) 2. As you may have noticed, it is quite easy to generate a band name that can be considered inappropriate. Since you would like to tell your grandparents about your band too, you prefer an appropriate name. How would you go about making sure that the generated name is appropriate? What are some of the challenges that come with doing so? * Of course, if you want to implement them please go ahead! It is great practice. ** Thanks to GitHub user hugsy for providing the adjectives and nouns files.Please submit: 1) A zip file containing your code and a PDF with the answers to the questions above. Name the file ‘FirstName_ID_lab_asg1.zip’ and keep the exact same file structure as the zip that was provided for the assignment. So, for example: Filename: Cor-Paul_1234567_lab_asg1.zip |——- solution.pdf |——- adjectives.txt |——- nouns.txt |——- src | |——- ece325 | | |——- labs | | | |——- lab1 | | | | |——- *.java 2) A screencast/movie that shows the following steps: • Open your eClass with your name shown • Open your IDE • Show your code briefly • Execute your code and demonstrate that your class works correctly. Please submit the screencast as a separate file to eClass. No screencast = -5 points. Please do not modify any of the names/methods we’ve defined in the provided *.java files.
In this laboratory exercise, you will delve into the implementation of combinational logic circuits, leveraging the Xilinx Zybo Z7 FPGA development board and the Xilinx Vivado software suite. The lab tasks will involve designing and constructing a Multiplexer (MUX) and Demultiplexer (DEMUX) circuit, alongside a Lab Access Control circuit. The practical application of these tasks aims to enhance your understanding of digital electronics and FPGA programming.Upon completion of this laboratory exercise, students will be able to: 1. Demonstrate a comprehensive understanding of the fundamental concepts and operations of combinational logic circuits. 2. Apply knowledge of Karnaugh Maps (K-maps) in the design and optimization of logic circuits. 3. Integrate the theoretical knowledge of digital logic design with practical skills in FPGA development. 4. Utilize VHDL programming to implement and validate combinational circuit logic.PRE-LAB You are expected to first read the lab instructions and background and then complete the pre-lab assignment and submit it to eClass before the pre-lab deadline. Refer to the Tutorial Lab (Lab 0) as well to refresh lessons learnt.Part I – Multiplexer (MUXs) and Demultiplexer (DEMUXs): a) Determine the algebraic expressions for (You don’t have to use Karnaugh maps): i) The output of the MUX (shown as M in Figure 1) in terms of the three inputs and the mux-select signals.ii) The outputs of the DEMUX in terms of the MUX output (M) and the demuxselect signals. Include a truth table for each output (Follow the same order in the MUX truth table). iii) The ENGINEER INDICATORS outputs of the DEMUX in terms of the demuxselect signals.b) Implement your expressions using VHDL logical operators, such as AND, OR, etc. (Handwritten or VHDL file).Part II – Lab Access Control: a) Refer to Part II of BACKGROUND section (later in this document) to help understand design requirements and complete TRUTH TABLE below comprising five inputs (C0, C1, K0, K1, K2) and three outputs (Alarm, Lab0_Unlock, Lab1_Unlock) signals relating to the problem discussed. To derive the logical expressions, you may either use karnaugh maps (provided in the appendix) or use the direct method. A valid expression and circuit will be accepted even if it is not in the minimal form.Note: Some of the entries in the table are completed to establish clarity on requirements. b) Provide the algebraic expressions and draw the logic circuit for the following outputs: 1. Lab0_Unlock 2. Lab1_Unlock 3. Alarm Hints: ➔ You may use the “Lab0_Unlock” and “Lab1_Unlock” signals in your expression for the “Alarm” signal. ➔ A valid expression and circuit will be accepted even if it is not in the minimal form.Part I – Multiplexer (MUXs) and Demultiplexer (DEMUXs): The definition of multiplexing is combining several signals for transmission on some shared medium (e.g. a telephone wire). The signals are combined at the transmitter by a multiplexer (“mux”) and split up at the receiver by a demultiplexer.The communications channel may be shared between the independent signals in one of several different ways: time division multiplexing (TDM), frequency division multiplexing (FDM), or code division multiplexing (CDM).Generally, MUXs and DEMUXs are available as standard integrated circuit (IC’s the 74150 16 to 1 mux) packages allowing for 2, 4, 8, or 16 inputs/outputs. Sometimes, however, a non-standard input number is made from discrete AND and OR gates as required.Let’s discuss a practical scenario to help understand MUXs and DEMUXs. This is the case for a small radio telescope array.You have been hired by the NASA (UofA employees union) as a co-op student. One of your tasks is to design, build, and test a 3- input/1-output MUX and an accompanying 1- input/3-output DEMUX using AND, OR, and inverter gates. The layout of the system is shown in Figure 1. We see that there are three parabolic antennas, each with a remote data logger. Each data logger sends important digital data back to the control center, via a shared transmission medium, where the radio astronomers analyse them.The MUX’s output will be fed into the DEMUX, and each of the three data outputs from the DEMUX (indicated by O) will be directed to the offices of three different radio engineers. O(0) is the output for the first engineer whereas O(1) and O(2) are the outputs for the second and third engineers. The MUX select inputs (S) will be employed to choose which antenna will receive data and the DEMUX select inputs (DS) will determine the receiving engineer.An RGB LED will be used as an ENGINEER INDICATOR (EI) where it being ON provides them with visual confirmation that they are receiving data (even if the received data be “0”). The color of the RGB color indicates which engineer is receiving the data with red, green and blue corresponding to the first, second and third engineers. Each color of the RGB color is separately controlled with EI(0), EI(1) and EI(2) being for red, green and blue colors. Only one of these should be in the ON state “1” for our particular application here.The MUX passes one of the three data signals to the transmission line. The desired data stream is selected using the mux-select signal S(0) and S(1) based on the following table: On the receiving end, the target engineer is selected using the demux-select signals DS(0) and DS(1). This is an example of Time-Division Multiplexing as the signals are separated on the transmission medium by time (only one signal at a time).Part II – Lab Access Control Circuit: The ECE department has decided to increase the access security for two of its new labs (Lab0, Lab1). There will be a card reader and a keypad to manage access to the labs. To gain entry into a particular lab, the user must have a OneCard with the correct code, and enter an authorized 3-bit code using the keypad.In this part of the lab, you will design and implement a circuit to control the access to the two new labs (Lab0, Lab1). The door access to each of these labs is controlled by the circuit shown below. The control circuit accepts two input signals: Card-code (2-bits), and Keypad (3- bits). The output signals are Alarm, Lab0_unlock, and Lab1_unlock.The description of the input/ output signal is as follows: 1. Card-code: This is a 2-bit input signal (C0, C1) which represents the output of the card reader. 2. Keypad: This is a 3-bit input signal (K0, K1, K2) which represents the access code the user entered into the keypad 3. Lab0_Unlock: This is a one-bit output signal that is applied to the door lock of lab0. The door of lab 0 will be unlocked if (Lab0_Unlock = ‘1’).4. Lab1_Unlock: This is a one-bit output signal that is applied to the door lock of lab1. The door of lab 1 will be unlocked if ( Lab1_Unlock = ‘1’). 5. Alarm: This is a one-bit output signal that triggers the alarm system.Function Description: A. To gain access to a particular lab (Lab0, Lab1), the user must use a valid OneCard and enter an authorized keypad code for that lab. When the combination of card code and keypad code is correct, the appropriate unlock signal should be turned ON “1”, otherwise the unlock signal will be OFF “0”.B. The Alarm signal should go high if : 1. There is a card in the reader (i.e C0 C1 is not equal to “00” ) 2. The combination of the card code (C0 C1) and keypad pressed keys (K0 K1 K2) is not correct.1. (Part I: Simulation) Consider the system demonstrated in Figure 1. Implement a MUX with 3 inputs, I, and S as the signal select. Connect the output of the MUX to your DEMUX with DS as data select and O as its output. Simulate your design and show the results using the test-bench file provided, “lab2_part1_tb”. In writing the test-bench file we supposed your entity name is “lab2_part1”. Test-bench imports your design as a component. Therefore, it needs your entity name. If you have used another name for the entity, open the test-bench file in Vivado and change the name in the component and port map sections.2. (Part I: Hardware Implementation) Add the constraint file for Part I from the lab 2 files (lab2_part1_constraint.xdc) on eclass to your project. This provided constraint file will map the MUX select inputs (S(0) and S(1)) to the first two switches on the Zybo board and the DEMUX select inputs (DS(0) and DS(1)) to the last two switches. Figure 2 below shows this mapping. The DEMUX outputs (the received data by engineers, O(0), O(1) and O(2)) will be mapped to the three LEDs indicated.The data (I(0), I(1), I(2)) must be externally provided by the Analog Discovery 2 (AD2) tool to the Zybo board through its PMOD ports (See figure 2, Pmod ports are a type of input/output interface). To do this, you have to connect the provided adapter cable to the left Zybo Pmod port as shown in Figure 3 (a). On the other side of this adapter cable, you need to use breadboard jumper wires to access 1, 2 and 3 on the cable. This is shown in Figure 3(b). These wires have to be connected to the AD2 DIOs 1 to 3, respectively. You need to connect the Zybo board ground to the AD2 ground (black wire in Figure. 3b). The connection can be conveniently implemented on a breadboard.Figure 3 As you recall from lab 1, we have to use the software WaveForms to control the AD2. To set up the three-channel data, open the patterns tab and set up DIOs 1 to 3 as shown in the following picture:By clicking on the run button, AD2 will generate digital signals (clock pulse signals) with the corresponding frequencies in the figure on its DIOS 1 to 3 which then will be fed to the Zybo board as I(0), I(1) and I(2).After programming your board, you may now choose which of these data channels to be sent to which of the radio engineers. You can clearly distinguish the messages by the blinking frequencies on the LEDs. Toggle the first two switches to multiplex between the messages. Use the other two switches to choose which radio engineer gets to see the message.(Part II: Simulation) Implement your design using VHDL behavioural approach and show simulation results using the provided “lab2_part2_tb” test-bench file. Change the component and port map name in the test-bench file if necessary.3. (Part II: Hardware Implementation) Demo Part II on Zybo Z7 board using the Lab II constraints file on eclass. Description of the Zybo Z7 input-outputs for this design is as follows: If you push the C(0) or C(1) buttons, then their values are 1 otherwise 0. We will be using LED[0-1], SW[0-2], Button [0-1] and Led[6] (red) as illustrated in the above image. Each of the corresponding signals for part 2 will be displayed using the LEDs and the input signals c[0-1] and k[0-2] are assigned to the buttons and switches respectively. To demonstrate your design, mimic the simulation waveforms from the previous section and show a TA/LI.DOCUMENTATION This lab exercise requires a formal report. Don’t forget to include your name(s), section number, and lab number on the title page, and follow the guidelines described on the website. Handwritten reports will not be accepted.Answer the following questions in your report. MUX and DEMUX a. What happens when the mux-select and demux-select signals are in the unused state? b. Justify your design for the demux output (provide simulation waveform, indicate 2 time moments on the waveform, and justify the output condition based on the input signals).c. Discuss the differences between the design approach in this lab using the Xilinx FPGA board, and the design approach in lab1 using the discrete ICS and breadboard. List the pros and cons of each method and describe which one you would prefer and why.Lab Access Control a. Justify your design using the functional simulation (provide simulation waveform, indicate 2 time moments on the waveform, and justify the output condition based on the input signals). b. How would you rate the system’s effectiveness? c. Can you come up with a better user access system? If so then describe your system.REFERENCES ● Zybo Z7 Documentation, Tutorials and Example Projects, ● Zybo Z7 Reference Manual ● “Digital System Design with FPGA: Implementation Using Verilog and VHDL” ; Cem Ünsalan, Ph.D., Bora Tar, Ph.D; Pub. Date: 2017 McGraw-Hill Education; ISBN: 9781259837906
In this lab, we will implement a 7-Segment LED decoder displaying the hex value of binary inputs (SW0- SW3). Students will then learn implementing a binary up counter (0 to F) every second and the value displaying on 7-Segment LED. The 7-Segment LED decoder will be implemented using VHDL, simulation test bench for verification, implementing it on Zybo Z7 board.Upon completion of this laboratory exercise, students will be able to: ● Implement a 7-segment decoder and display switches (SW0-SW3) data. ● Implement a two-digit decimal up counter.PRELAB You are expected to read the lab instructions and complete the pre-lab assignments before attending the lab sessions. You are responsible for submitting your pre-lab assignment in time in the drop-box on eClass. 1. Given the truth table below, Use K-Maps, determine the minimum logical expression for each of the 7 outputs (A through G): Note: The provided truth table has been designed in accordance with the inverted schematic presented in the manual. This is due to the particular orientation of the Zybo boards when mounted to the lab bench, which results in the display of digits in an upside-down fashion.Therefore, be mindful when interpreting the truth table and correspondingly programming your display to ensure accurate output, reflecting this unique setup of the Zybo boards in our lab environment.2. Draw circuit schematic diagram using minimized logical expressions, either use all AND-OR-NOT logic or all NAND logic gates. 3. Write and include VHDL solutions in your pre-lab assignment (using VHDL logical operators, such as AND, OR, etc).4. Please watch the following video to get familiar with implementing a complete truth table using VHDL conditional statements: VHDL essentials 10 majority architecture. NOTE: It is very important that you complete the pre-lab components before you come to the lab. Failing to do so will impede your ability to complete the lab within the allotted time.BACKGROUND A 7-Segment display as shown below is manufactured using 7-LED’s (segments A, B, C, D, E, F, G), as shown in Figure 2. An LED (Light Emitting Diode) emits light when the Anode is connected with the positive side of DC voltage and the cathode is connected with the negative side, as shown below. Such a connection enables current to flow through the diode, enabling it to emit light.There are two standard types of 7-Segment displays; Common-Anode and Common-Cathode. The 7-Segment board connected with Zybo Z7 has Common-Cathode configuration, as shown in the schematic below.Common-Anode or Common-Cathode configuration reduces the number of pins necessary to operate the device, and also provides a global ON or OFF control. In this digital system, logic ‘1’ is equivalent to +3.3 volts and logic ‘0’ is equivalent to 0 volts.Note: Due to the unique constraints of our lab setup, specifically the manner in which the Zybo boards are mounted to the bench, it isn’t feasible to orient both the seven-segment display and the keypad in the standard ‘right side up’ position simultaneously. After careful consideration, we have opted to adhere to an upside-down representation to match the physical orientation of the seven-segment display on the Zybo board.As you can see both on the schematic above and the 7-Segment board connected to the Zybo Z7, we have two 7-Segments. In order to turn ON a segment on the first 7-Segment: 1. We must first apply logic ‘0’ to the cathode signal (C=>P4 pin), which makes C1 logic ‘1’. 2. Then apply logic ‘1’ to the respective segment (AA to AG) you want to turn ON.To select and display on second 7-Segment, follow these steps: 1. Apply logic ‘1’ to the cathode signal (C=>P4 pin), which makes C2 logic ‘1’. 2. Then apply logic ‘1’ to the respective segment (AA to AG) you want to turn ON.PART 1: 7-SEGMENT DECODER IMPLEMENTATION 1. Download from eClass: design source, constraint file,and simulation test bench. 2. Create a new VHDL project to implement your minimized logical equations for all segments (the pre-lab work). 3. Demo your work to a TA or LI for a checkoff and present simulation results in your lab report.PART 2: 7-SEGMENT DECODER IMPLEMENTATION USING VHDL 1. Implement 7-segment decoder design using VHDL if-else or case statement. You can find an introduction to these statements in VHDL in this video:VHDL essentials 10 majority architecture. 2. Demo your works to a TA or LI for a checkoff and present simulation results in your lab report.PART 3: COUNTER FROM 0 TO LAST 2 DIGITS OF STUDENT ID 1. Download from eClass: design source, constraint file, and simulation test bench. 2. Create a new project and complete the design source. Comments in the design source explain what needs to be done. You are required to simulate and implement an up counter starting from 0 to the last 2 digits of your student ID in decimal.Note: If the last 2 digits of your student ID are less than 10, please choose any number greater than or equal to 10.Hint: CC signal can toggle between ‘0’ and ‘1’. While it is zero, you are displaying 4 bits on one 7- segment. While it is one, you are displaying another 4 bits on the other 7-segment. 3. Demo your work to a TA or LI for checkoff and present simulation results in your lab report.REPORT REQUIREMENTS 1. Simulation results of parts 1 and 2 for digit 0 to F. 2. Codes for parts 1, 2 and 3. 3. Simulation results for part 3. (Showing up counting to the last two digits of your student ID and starting back from zero. For example, Student ID = *****62) Note: In order to show signals in your waveform window, on the scope select your design source. Right click on the signal and select Add to waveform window Note: In order to run the simulation as long as your student ID, click on 1 to reset it, then set a time in ns (suggested time: last 2 digits of your student ID + a few ns e.g., 65 for someone with the last two digits equal to 62). Then click on 2 to run the simulation.REFERENCES ● Zybo Z7 Documentation, Tutorials and Example Projects, ● Zybo Z7 Reference Manual ● PMOD Seven Segment display Reference Manual
In this laboratory, students will engage in the development of combinational logic circuits, initially utilizing basic AND, OR, and NOT logic gates.Following this, they will code their solutions using VHDL, utilizing the Xilinx Zybo Z7 FPGA development board along with the Vivado software. This lab serves as a crucial stepping stone, providing a transition from circuit prototyping with Integrated Circuits (ICs) to Field-Programmable Gate Arrays (FPGAs).Upon completion of this laboratory exercise, students will be able to: 1. Demonstrate a comprehensive understanding of the fundamental concepts and operations of combinational logic circuits, including AND, OR, and NOT logic gates. 2. Apply knowledge of Karnaugh Maps (K-maps) in the design and optimization of logic circuits.3. Integrate the theoretical knowledge of digital logic design with practical skills in FPGA development, K-map optimization, and circuit monitoring, preparing for more advanced studies and applications in the field of electronics and digital system design.4. Utilize VHDL programming to implement and validate logic circuit designs on the Xilinx Zybo Z7 FPGA development board using the Vivado software. 5. Acquire hands-on experience prototyping circuits by building them on a breadboard and also more versatile methodologies like programming Field-Programmable Gate Arrays (FPGAs) using VHDL.6. Develop proficiency in utilizing the Analog Discovery 2 for real-time monitoring of circuit inputs and outputs, aiding in performance analysis and troubleshooting.PRE-LAB You are expected to first read the lab instructions and background and then complete the pre-lab assignment before attending the lab sessions. The pre-lab assignment must be submitted on eClass and no late submission is accepted. You are responsible for including your pre-lab assignment as an appendix to your lab report if necessary.Refer to Tutorial Lab as well to refresh lessons learnt. Logic equation(s) minimization using Karnaugh Map (K-Map): a. Simplify the logic relationship between inputs A,B,C & D to outputs F1 and F2 using K-Map, helping us find the simplest logic equation for any given truth table. (Prelab write up required).b. Draw the schematic diagram of minimized equations using only AND, OR and NOT gates. (Pre-lab write up required). c. It’s highly recommended to finish breadboarding of minimized functions (F1, F2) before coming to the lab, this will help you accomplish assigned tasks within lab time. If you aren’t familiar with breadboarding, you must watch the tutorial videos posted on eClass, helping you understand the basics on how to use the breadboard to build a circuit. (no write up required).Integrated Circuits (IC’s) pinouts and packaging forms: The primary categorization of IC packaging can be divided into two groups: through-hole and surface mount. ICs are available in a variety of packaging forms, each offering its unique set of features and uses. Key types include DIP (Dual In-line Package), SOIC (Small Outline Integrated Circuit), QFP (Quad Flat Package), and BGA (Ball Grid Array).The following table offers a comparative view of these popular IC packaging forms. Package Names, relevant questions and answers Pictorial Looks Dual-In-Line Package (DIP) DIP is a through-hole type and has two rows of pins intended for insertion into a breadboard or PCB. Where is DIP package pin #1?Answer: Seeing the package from the top, one can see a notch in the top middle or a circle on the top left hand side or both, as shown in the pictures. The top left pin is #1 and the pin numbering goes as shown and in the respective IC datasheet.Small outline integrated circuit (SOIC) A smaller, surface mount type, is designed for situations where space efficiency is crucial. Where is SOIC package pin #1? Same as described for DIP packages.Quad Flat Pack (QFP) Another surface mount type, has four sides of pins and is useful for higher pin count applications. Where is QFP package pin 1? Top left has the etched circle or cut corners as shown. click here to see an example from a component datasheet Ball Grid Array (BGA) Known for its underside grid of solder balls, allows for even higher pin densities and improved heat dissipation.Where is BGA package pin #1 An etched circle, marked line, cut corners or all combined on the top left hand side of the package. Refer to package datasheet for pinout details. click here to see an example from a component datasheetLAB HARDWARE Students will use Digilent Analog Discovery 2 hardware to complete this lab. Let’s get you introduced to this hardware tool. Key features and associated details: ➔ 30MHz bandwidth 2 channel (1+, 1-and 2+, 2-) oscilloscope. ➔ 12MHz bandwidth 2 channel (W1, W2) waveform generator. ➔ One positive supply channel (V+) of up to +5 volts. ➔ One negative supply channel of (V-) up to -5 volts. ➔ Two channels for external triggering (T1, T2). ➔ 16-channel (0-15) digital logic analyzer. ➔ ↓ => Ground (GND) lines.DIGILENT ANALOG DISCOVERY 2 WaveForms is the software for Analog Discovery 2 and here are details on how to use it. 1. Doubleclick on Waveforms icon on desktop OR click Start➞All Programs➞Digilent➞Waveform Applications➞Waveforms 2. Click on the Settings menu and select Device Manager.3. Select “Discovery2” and click “Select” on bottom right. This will bring you back to the main screen. 4. Click on the Wavegen button to open the waveform generation tab. Click the “Run All” or “Run” button. a. Click on the Supplies button to open the DC power supply generation tab.b. Select 5V for V+ and click on to turn it on. c. Click on Scope button to open oscilloscope channels monitoring tab and click Run5. This step is only for students to learn the capabilities offered by Digilent Analog Discovery. Students may or may not use the waveform generator capability of Analog Discovery in this lab. a. Connect waveform generation W1 channel with oscilloscope channel 1+ and power supply V+ channel with 2+, using male header pins or stripped wires. b. You will get the following view depending on your selection on the right hand side menu options.6. Learn how to use the voltmeter. Turn the voltmeter selector knob from off to DC voltage, make sure the red voltmeter cable is inserted in and black cable is inserted into . Now touch the red cable with V+ output channel and black to ⏚ (GND) to see results.Students should now be comfortable in using additional lab resources to what they learnt in Lab 0 to complete this lab, any questions kindly seek guidance from lab teaching staff. Note 1: Human touch to an IC package pins without properly grounding their body can potentially damage the IC due to the body’s static electricity discharge to IC. Lab teaching staff will demonstrate how to hold and use IC packages.Note 2: In this lab we will only be working with DIP IC packages.LAB COMPLETION REQUIREMENTS 1. Implement your pre-lab design using breadboard, stripped wires and AND, OR, NOT logic gates, for F1 and F2. Relevant components datasheets will provide all information required to complete this section of the lab. Use voltmeter to verify a few cases. Note 3: Use the V+ DC supply output to power IC’s. Refer to the IC datasheet to find the pinout and connect VCC and GND appropriatelyNote 4: The use of pull up and pull down resistors is a common practice in digital electronics by ensuring that the input pins are always in a well-defined state, either ‘HIGH’ (logic 1) or ‘LOW’ (logic 0) to prevent ‘floating’ states from generating noisy signals that might interfere with the circuit’s expected behavior. In the next drawing and picture you can see the schematic diagram and breadboard connections for both of these.Note 4: With pull up/down resistors in place you can connect the DIO ports from the Analog Discovery 2 as you would connect normal inputs, this approach guarantees that the inputs will always have a defined value in case the input is left ‘floating’Instead of connecting your inputs to power and ground through the pull up or pull down resistors, connect them to the DIO ports of the AD2(Analog Discovery 2) Using the patterns window and the logic window, you can test your circuit.Note 5: The circuits depicted in the previous illustrations serve only as a demonstrative example and should not be construed as a solution to the exercise problem. They are intended to illustrate key concepts and assist you in understanding the overall layout and connections. Students are expected to refer to relevant IC datasheets to find pinouts and make the required connections to implement the solution for F1 and F2.Open the patterns window and set up input patterns as shown in the next picture:In the logic window, add the input and output signals (DIO0 to DIO5) as shown in the next picture. Change the scan mode to “Shift” and then click on Scan. You should be able to see the patterns live, for both the inputs and outputs. Click on Stop and capture all cases. Include such snapshots in your report.2. Implement pre-lab work using VHDL and show simulation test cases results. Open a new project in Vivado, follow the same steps mentioned in lab 0. Download lab1.vhd, lab1_tb.vhd, and const.xdc from e-class and add them to your project. Complete source file, lab1.vhd. Assume: sw(0)= A sw(1)= B sw(2)= C sw(3)= D led(0)= F1 led(1)= F2 led(0)
1. You have two puzzles with parameters as follows: Puzzle A: One sub–puzzle. k = 5. Puzzle B: Four sub-puzzles. k = 3.You should provide, for both cases other than part (b), the following: (a) The distribution of the number of cases that require each number of hashes. 1 Mark (b) Explain the method you used to obtain your distributions. Don’t go into too many details or show working, it’s more “I wrote a C++ program to … and then using … I …”. 0.5 Mark(c) A graph of the distribution of the data above. 0.5 Mark (d) The average number of hashes needed. 0.5 Mark (e) The population standard deviation for the distribution of the number of hashes needed. 0.5 MarkYou should assume that if there are N possible solutions you check the Nth by hashing even if all others have failed and there has to be a solution.2. Using a TCP SYN spoofing attack, the attacker aims to flood the table of TCP connection requests on a system so that it is unable to respond to legitimate connection requests. Consider a server system with a table for 512 connection requests.This system will retry sending the SYN-ACK packet five times when it fails to receive an ACK packet in response, at 30 second intervals, before purging the request from its table. Assume that no additional countermeasures are used against this attack and that the attacker has filled this table with an initial flood of connection requests. At what rate (per minute) must the attacker continue to send TCP connection requests to this system in order to ensure that the table remains full? Assuming that the TCP SYN packet is 32 bytes in size. How much bandwidth does the attacker consume to continue this attack? 1 Mark3. Consider that the incidence of viral attachments in email messages in 1 in 250. Your malware checker will correctly identify a message as viral 95% of the time. Your malware checker will correctly identify a message as non–viral 95% of the time. Your malware checker has just flagged a message as being malware. What is the probability that the message is actually okay (no malware)? Justify your answer using Bayes theorem. 1 Mark4. Many websites use a CAPTCHA image on their login page. A typical application of this is in an HTML form asking for the email ID and the login password of a user. The webpage also shows some numbers and letters, modified in a manner such that it is still easy for a human to recognize these characters. The user is then asked to recognize these characters and is granted login access only when they successfully enter the characters. Explain how using a CAPTCHA can help prevent email spam. What is the main difficulty with using CAPTCHAs?. 1 Marks5. What are honeypots? How are they better at resisting spam bots than CAPTCHAs? 1 Mark 6. Briefly describe, in your own words, each of the following. Be sure to specify the domain and nature of each. (a) WannaCry. 0.5 Mark (b) XML Bomb. 0.5 Mark7. Consider the database below and answer the questions based on it. Name Gender School Position Salary Alex Male Computing Lecturer $80,000 Bob Male Mathematics Lecturer $60,000 Carol Female Mathematics Lecturer $100,000 Diana Female Computing Lecturer $65,000 Ewen Male Physics Lecturer $72,000 Fran Female Physics Lecturer $98,000 Gary Male Computing Administrator $40,000 Humphry Male Mathematics Lecturer $72,000 Ivana Female Computing Tutor $12,000 Jeff Male Physics Administrator $80,000 Kim Female Mathematics Lecturer $100,000 Lex Male Computing Tutor $12,000 Morris Male Engineering Tutor $15,000Assume you only have a statistical interface, so only aggregate queries will be successful. You know Fran is a female Physics Lecturer and Dania is a female Computing lecturer. The questions below explore how we might determine her salary using inference, in the presence of various query size restrictions.(a) Assume there is no limit on the query size. Give a sequence of two queries that will identify the salary of Fran. 1 Mark (b) Suppose that there is a lower and upper query size limit that satisfies k ≤ |X(C)| ≤ N − k with k = 2. Show a sequence of queries that could be used to determine Diana’s salary. 1 MarkNotes on submission 1. Submission is via Moodle. Your submission file should be a PDF file. 2. Late submissions will be marked with a 25% deduction for each day, including days over the weekend.3. Submissions more than three days late will not be marked, unless an extension has been granted. 4. If you need an extension apply through SOLS, if possible before the assignment deadline. 5. Plagiarism is treated seriously. Students involved will likely receive zero.
1. Determine the entropy associated with the following method of generating a password. 1 Mark Choose, and place in this order, one lower case letter, following by one upper case letter, followed by two digits, followed by @, followed by two letters, each upper or lower case, and then followed by four symbols drawn from the set {$,7,3,v,w,J,z,T}. Finally, apply the hash function Tiger to give an output string in hex which will be used as a password.Assume random choices are made with equal likelihood of each symbol from the space being chosen from. So for a random digit there are 10 possibilities, each chosen with probability 1/10.2. For the following collection of statements, describe the sets of actions, objects, and subjects; and draw an access control matrix to represent the scenario. 1 Mark Alice can climb trees and eat apples. Bob can climb fences, eat apples, and wave flags. Trees can hurt apples. Carol can jump waves, eat apples, and wave flags.3. Assume an application requires access control policies based on the applicant’s age and the type of funding to be provided. Using an ABAC (attribute-based access control) approach, write policy rules for each of the following scenarios:(a) If the applicant’s age is more than 35, only “Research Grants (RG)” can be provided. 1 Mark (b) If the applicant’s age is less than or equal to 35, both “RG and Travel Grants (TG)” can be provided. 1 MarkPart Two: Authentication and access control system 6 Marks You are to implement a simple “file system” with login authentication and access control. Specifically: • Construct a hash/salt/shadow based user/password creation system. • Construct a hash/salt/shadow based user authentication system. • Construct an associated file system, into which a user can log. Files can be created, read from, written to, but only in accordance with a four–level access control model.• The levels of the four–level access control model are 0, 1, 2 and 3. 0 is dominated by 1, 2 and 3; and 1 and 2 are dominated by 3; and 1 is dominated by 2. Remark: You do not need to have an actual file system, simply an internal collection of records at the levels specified.You can implement the program in C++, C, Python or Java. You will use MD5 in this task. The initialisation details 2 Marks Your program will, initially, need blank files salt.txt and shadow.txt. Running your FileSystem with the instruction FileSystem -i runs the hash/salt/shadow based user/password creation system. This program should prompt for a username, something like… Username: Bob Check if the username exists already. If it does, terminate the program with an appropriate notification to the user. If it doesn’t, request a password with something like … Password: …….. Confirm Password: ……..You should add some appropriate checks on the password and give a warning if the user fails to meet a requirement. The warning should explain what the requirements are. Assuming the passwords are acceptable and the same, we make a final request of the user, something like … User clearance (0 or 1 or 2 or 3): 1Once we have this information we can modify the salt.txt and shadow.txt files to include this user. To salt.txt we add a line, with a generic example and a specific one for user Bob given here: 2 Username:Salt Bob:38475722where Salt is a randomly chosen string of 8 digits. We also add a line to shadow.txt, with a generic example and a specific one for user Bob given here: Username:PassSaltHash:SecurityClearance Bob:dd2da44f4437d529a80809932cb3da83:1 PassSaltHash is generated as the MD5 hash of the concatentation of the user’s password with the salt, For example if the Password is “alphabet” and the Salt is “12345678”, we would pass “alphabet12345678” to the MD5 function.Logging in 2 Marks Running FileSystem with no arguments will allow a user to try and log into the file system. Remark: The file Files.store needs to be loaded, see later as to what this contains. Username: Bob Password: …….. The system checks if the Username is listed in the file salt.txt. If the Username is in the file then their salt value is retrieved and the PassSaltHash is generated. A message should be displayed to indicate that the salt has been retrieved. Bob found in salt.txt salt retrieved: Salt hashing … hash value: PassSaltHashThe system should now compare the PassSaltHash value with that in the file shadow.txt. If the information in shadow.txt doesn’t match the generated information, FileSystem should stop with appropriate error messages.If the shadow.txt information matches, the clearance of the user is reported, and authentication is reported to be complete. Authentication for user Bob complete. The clearance for Bob is 1. Once logged in … 2 Marks A list of allowed actions is now displayed Options: (C)reate, (A)ppend, (R)ead, (W)rite, (L)ist, (S)ave or (E)xit. (1) The C option will result in a request for filename from the client. Filename: alphaThe classification level of a “file” is the same as its owner’s clearance level. The program should maintain a list of “files” as internal entries. If the passed file doesn’t exist, it’s name, owner and classification should be added to the list. If the passed file does exist an appropriate message should be displayed and the system should re–display the menu marked (1). The A, R and W choices each results in a request for a filename. Filename: alphaAgain a check is made as to whether the file exists. If the file doesn’t exist an appropriate error message should be provided and the menu (1) should be re–displayed. If the file does exist, a message informing success or failure will be displayed. Success or failure is determined by the relative clearance of the user and the classification of the file they are trying to access, in accordance with the Bell–LaPadula model. Subsequently the menu (1) should be re–displayed.The L option lists all files in the FileSystem records. The S option saves all the data to a file Files.store. This file should be human readable. This file should always be loaded if it is available when FileSystem starts without the -i argument. The E option should exit the FileSystem, after checking with the user: Shut down the FileSystem? (Y)es or (N)oNOTE: When your program starts, before bringing up a prompt, it should report a test output of the MD5. You should call your MD5 with the string “This is a test”. MD5 (“This is a test”) = ce114e4501d2f4e2dcea3e17b546f339 Don’t hard code this output.NOTE ON SUBMISSION 1. Submission is via Moodle. 2. Include the compilation instructions with your submission in a file readme.txt. That file should also contain a description of how you do the reduction. If no such file exists in your submission then you will get zero for this Part two. 3. Make sure your program run perfectly on CAPA before submission. If it does not run on CAPA when testing then you will get zero for this.4. Late submissions will be marked with a 25% deduction for each day, including days over the weekend. 5. Submissions more than three days late will not be marked, unless an extension has been granted. 6. If you need an extension apply through SOLS, if possible before the assignment deadline. 7. Plagiarism is treated seriously. Students involved will receive zero.
This assignment is to write a C++ for hexadecimal numbers processing by using generic programming. Given different types of documents (e.g., bonds, certificate, and contracts), the task is to compute some statistics based on similarity between documents and character frequencies.For simplicity, we assume that there is only one type of documents, that is certificate. You need to define two classes, namely, Certificate and DocumentHandler. The first class is for recording information of each certificate and similarity computation whereas the second class is for computing some statistics based on all certificates.Roadmap. In Section 1 and Section 2, we will describe the Certificate class and the DocumentHandler class respectively. In Section 3, we will describe how to run your program and what your program should output. In Section 4 and Section 5, general notes and submission guidelines will be described.Section 1: the Certificate class A certificate consists of many two-digit words. Each word should be a mix of hexadecimal digits from 0 to 9, and A to F (e.g., A8, FC, 24, and 4C). The following methods should also be provided. • A constructor: this class should have a parametric constructor, while the input argument is len. This constructor will produce random two-digit words, with the total number of len. These words should be a mix of hexadecimal digitsfrom 0 to 9, A to F, and it will be considered as the content of this certificate. • A container: it should store the words and spaces between words (i.e., all content in this document). • A destructor: it should output notice of object destruction.• A display function: this method should display the content of this certificate. • A similarity function: this method should take another certificate instance to compare. The result is the number of related words that appear in both certificates. Given a word w1 from certificate 1, and a word w2 from certificate 2, w1 and w2 is related if their first digits are the same.Example: the similarity between two certificates below is 4, since we have 4 pairs of related words (i.e., CF and C0, FB and FA, 40 and 4A, and 4F and 4A). 8C CF FB 0E AD A7 40 4F A6 (vs) 3F 9E FA DF 63 99 5B 4A C0Section 2: the DocumentHandler class This class should be used to store and process a collection of documents of the same type. Therefore, it should be a template class and the input documents of the same type should be treated as a generic type T. For simplicity, we only assume there is only one document type (i.e., certificate). However, this class must treat input certificate as a generic type T, and we can also assume that our provided member attributes and functions in these two classes are general and can work with different types of documents.The following methods and attributes should be provided. • A container which stores all document objects of the generic type T. • A display function: this method should display the content of all stored documents. • A minSimilarity function: this method should determine the minimum number of related words across all pairs of documents in the container.• A digitStatistics function: this method should determine digits with the minimum and maximum frequency. It is possible that there are multiple digits with the same maximum or minimum frequency. Report all of them. Example of illustration: we have three lines below and each line refers to one certificate with 6 words. B6 EB FB 4E 35 A6 3C A7 4C AA 3B 4D 10 23 44 D1 D2 5E For 1st and 2nd certificates, we have: B6 EB FB 4E 35 A6 (vs) 3C A7 4C AA 3B 4D Similarity: 6 for 1st and 3rd certificates, we have: B6 EB FB 4E 35 A6 (vs) 10 23 44 D1 D2 5E Similarity: 1 for 2nd and 3rd certificates, we have: 3C A7 4C AA 3B 4D (vs) 10 23 44 D1 D2 5E Similarity: 2 As such, we have minSimilarity = 1; For digitStatistics, the result will be: Less frequent digits with count: 0 : 1 7 : 1 F : 1 Most frequent digits with count: 4 : 5Section 3: The Program Execution and Output Given your executable program ‘main’, the number num of certificates to be generated and the number len of words in each certificate, you program should run like below: ./main num lenProgram output: you should display all certificates’ content, followed by minimum similarity, the least frequent digits and most frequent digits with frequency statistics. Below is an example output:Section 4: General notes 1. Your assignment should be sensibly organised with the same kind of expectations as previous assignments. 2. Other than the initial command line input, the program should run without user input. This means there should not be pauses during the running time.3. A character can be represented as an integer. You can search ASCII table online for more details. 4. You must use classes and the DocumentHandler class must be a template class and handle certificate objects as a generic type.5. For all of your self-defined functions, they should be sensibly warped into classes as member functions. 6. You are allowed to use any built-in containers. 7. Make sure that your code is C++17 compliant and can compile on Capa.Section 5: Submission Guidelines Please submit a zip file via Moodle and it should only contain all your source files and a Readme file. 1. Late submissions will be marked with a 25% deduction for each day, including weekends. 2. Submissions more than four days late will not receive marks, unless an extension has been granted.3. If you need an extension, apply through SOLS before the assignment deadline. 4. Academic misconduct is treated seriously. Specifically, any plagiarised work will be awarded a zero mark and reported to the University.
This assignment is to be implemented using object-oriented programming. It involves implementing a simulation of an election campaign in a country named Verdeloria. The citizens, called Verdeloria people, are voting in an election for three political parties, from which a winner will emerge.In this campaign, you will see three political parties and they try to win votes from electoral divisions. For each party, there are a leader and some candidates. Each candidate will represent the party to earn votes in an electoral division (i.e., for each party, there is only one candidate for one different division).Each candidate and electoral division will express their own’s stances on five national issues. The voting score received by a candidate in one division is heavily impacted by voting factors (e.g., the characteristics of the leader, the candidate and the division). In the campaign, there are several event days. On each event day, each division will have a probabilistic event which impacts the voting factors.On the election day (i.e., the last day of the campaign), these factors will be finalized and used to compute candidates’ voting scores. For each division, the candidate with the highest voting score (and thus the corresponding party) wins this division. The party which wins the greatest number of electoral divisions is the final winner and it will form a government. If there is more than one party with the greatest number of wins, there is a hung parliament.Organization structure of the specification. In what follows, major components in the campaign (e.g., issues, stances, divisions, event days and election days) will be introduced in Section 1. The assignment requirement will be introduced in Section 2. The general note will be introduced in Section 3 followed by the submission guideline in Section 4.1 Components 1.1 Issues and Stances Issues. You need to decide on five (5) national issues (e.g., global warming, road infrastructure and rights of pets) with some simple description of why each issue is important.Stances. For each issue, we model a stance on an issue as a tuple of 2 elements: The first element: the significance of the issue (quantified as a real number between (0, 1]). The second element: the strength of the measure for addressing this issue (quantified as a real number between (0, 1]. You do not need to specify what the measure is.)Note: Each candidate and division will have a stance on an issue respectively. 1.2 Political parties and candidates 2 Each political party consists of: 1. One leader (not a candidate) with attributes: popularity (a real number between (0, 1]) . 2. One candidate (for each electoral division) with attributes: stances on five issues and popularity (a real number between (0, 1]) . Stance’s initialization: for each party, candidates should have the same initial stance for a specific issue. The initialization should be randomized, and you define the randomness (e.g., the sampling distribution and the sampling range). Note that these stances may change across event days.1.3 Electoral divisions In each division, there is only one stance for one issue. Each stance is initialized randomly, and you define the randomness. Note that these stances may be changed across event days. In addition, each division has a population attribute which is a real number uniformly sampled from [0.5, 1.5]. The unit is million and can be omitted during necessary computation.1.4 Campaign Event Days Each event day of the campaign involves the candidates and leaders attempting to convince the electoral divisions to support them. On each event day of the campaign, in each electoral division, one local event below will happen with a probability of 0.8. It means that, nothing will happen with the probability of 0.2, and the sum of probabilities of the four local events below is 0.8. It is up to you to decide how the probability of 0.8 is distributed among these events (the probability of each event needs to be a non-zero value):1. Two candidate-related events. One event will impact candidates’ popularity and another event will affect the stances of candidates in this division. You define the rules of events. 2. One leader-related events. It will affect the popularity of the corresponding leaders. You define the rules of this event.3. One Issues-related event. This event will affect the stances of the electoral division on the issues.You define the rules of this event. Note: these events will not make all related attributes’ values exceed their predefined ranges (e.g., (0, 1] is the range for popularity and stances). You need to ensure some uncertainty in all events such that we will not always have the same result if we run the same local event multiple times.1.5 Election day and wrapping up… On the election day, you need to do three tasks: you need to compute voting scores and decide the winner. For each candidate, he or she will only receive one score from one division. In each division, there are multiple candidates from different parties. The candidate with the highest voting score will win the electoral division election.Below is how you compute the score for one specific candidate in one division. The formula considers the following factors, listed in order of their significance. 1. The Stance-and-population Factor. The value of this factor depends on the division population and the average cosine similarity between the candidate’s stances and the division’s stances.For each issue, the stance cosine similarity between a candidate and the stance in the electoral division is calculated based on their two-dimensional stances. Give two vectors/stances X=(x1, x2) and Y=(y1,y2), below is how to compute the cosine similarity: . The final Stance-and-population Factor can be computed as (* means ‘times’): Average cosine similarity (over 5 stances) * population in this division (a real number from [0.5,1.5])2. Candidate popularity. 3. Leader popularity. Therefore, the voting score of one candidate can be computed as: A*(The Stance-and-population Factor) + B*(candidate popularity) + C*(leader popularity). Requirements of Coefficients (A, B and C): their sum should be equal to 1 and A>B>C. You decide the values of A, B and C.You need to write a program and the requirements are specified in Section 2.1. You also need to write a report in PDF format and the requirements are specified in Section 2.2.2.1 Program requirements Your code must compile on Capa (i.e. it must be C++17 compliant) according to instructions you provide in a Readme file. Once your program is compiled into the executable APE, it must run as follows: ./APE n m where n is the number of electoral divisions in the nation, and m is the number of campaign days before the election. The value of n should be in the range 1 to 10 inclusive. The value of m should be in the range 1 to 30 inclusive. You can use some containers to pre-store the names of parties, divisions and people, and retrieve these values to create objects/instances.Once you run the program, the program needs to output something below to the terminal: 1. Before campaign event days, you should output: party report: info of parties (i.e., all entities with attributes in each party) and (2) nation report: each electoral division with stance distributions and population.2. During the simulation of each event day, you should output: (1) event related report: what has happened in each electoral division, and the significant impacts that have occurred, (2) party report: info of parties (i.e., all entities with attributes in each party), (3) nation report: each electoral division with stance distributions and population.3. On the election day, you should output: a national summary, showing the number of electoral divisions won by each party. The party which wins the greatest number of electoral divisions is the final winner and it will form a government. You need to report the party leader as the new leader for the country. If there is more than one party with the greatest number of wins, you should report a hung parliament.2.2 Content of the PDF report The report is worth around 40% of the total score and the diagram is worth around 50% of the report. The report should have six sections. The first five sections correspond to Sections 1.1 to 1.5. In these sections, you describe the details of these elements in the program: • For Section 1.1, describe how do you define issues with some simple description of why each issue is important, and how you model stances.• For Section 1.2, describe characteristics of all parties’ leader and candidates, and stances’ initialization. • For Section 1.3, describe divisions’ names, stances’ initialization and populations. • For Section 1.4, describe events’ rules and probabilities. • For Section 1.5, describe how you derive the coefficients.In the last section (Section 2), you need to draw the UML class diagram for this program. In the assignment folder, we provide instructions of using UMLet to draw the class diagram. You are also allowed to use other tools (e.g., draw.io and Lucid chart) for drawing the class diagram.3 General notes These are some general rules about what you should and should not do. 1. Your assignment should be sensibly organised with the same kind of expectations as assignment one, although you may reasonably have more files for this assignment.2. Other than the initial command line input, the program should run without user input. This means there should not be pauses waiting for the user to press a key. 3. Please do not submit project, object or .txt files. You must submit only the source code files (.cpp, .hpp) necessary to generate an executable file, APE, and the PDF report.4. You must use classes, and should make use of encapsulation. 5. You must use inheritance. If you do not use inheritance, you will lose few marks. 6. You can use polymorphism, if so, justify its appropriateness. 4 Submission GuidelinePlease submit a zip file via Moodle. It contains your source (i.e. .cpp and .hpp) files, a Readme file with instructions for compiling, and the PDF report. Make sure your report is in pdf format and has your name and student number marked clearly on it. The UML-like diagram should be an integral part of your final pdf-formatted report and not in a separate document. Your report may be ignored if it is a non-pdf format.1. Late submissions will be marked with a 25% deduction for each day, including weekends. 2. Submissions more than four days late will not receive marks, unless an extension has been granted. 3. If you need an extension, apply through SOLS before the assignment deadline. 4. Academic misconduct is treated seriously. Specifically, any plagiarised work will be awarded a zero mark and reported to the University.