Assignment Chef

Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] Csci 4170 homework 6

Part 1: Transformers Task 1 (30 points): In this task you should work with the Facebook BART model (https://huggingface.co/docs/transformers/en/model_doc/bart) to provide text summarization of news articles. Text summarization in Natural Language Processing (NLP) is a technique that breaks down long texts into sentences or paragraphs, while retaining the text’s meaning and extracting important information. Pick any one dataset of your choice. You may have to do data cleaning, preprocessing etc. Next, perform the following tasks: 1. Provide a description of the dataset you selected. Split your data into train-test set with a (90-10) split. 2. Load the model from Hugging Face’s Transformers library and write its training script. 3. Fine tune the pre-trained model with your data and report results on your test set. You must report the BLEU and ROUGE Scores. (See the code provided in class for more details) 4. Analyze your results and discuss the impact of hyperparameters. Are your results impacted by the choice of the LLM here? How? Part 2: Reinforcement Learning Task 2(20 points): We discussed how we can formulate RL problems as an MDP. Describe any real-world application that can be formulated as an MDP. Describe the state space, action space, transition model, and rewards for that problem. You do not need to be precise in the description of the transition model and reward (no formula is needed). Qualitative description is enough. Task 3(20 points): RL is used in various sectors – Healthcare, recommender systems and trading are a few of those. Pick one of the three areas. Explain one of the problems in any of these domains that can be more effectively solved by reinforcement learning. Find an open-source project (if any) that has addressed this problem. Explain this project in detail. Task 4 is for 6000 level ONLY Task 4(100 points): Implement the game of tic-tac-toe (write a class that implements an agent playing Tic Tac Toe and learning its Q function) using the Q-learning technique (see the resources/links provided in class for more details). Clearly describe your evaluation metric and demonstrate a few runs. You might need to use some online resources to proceed with this. Do not forget to cite those. Part 3: Recommender Systems Task 5 (30 points): For this task use the MovieLens 100k dataset (https://grouplens.org/datasets/movielens/100k/) Perform the necessary data cleaning, EDA and conversion to User-item matrix. Implement any 2 collaborative filtering recommendation systems (RecSys) algorithms covered in class (Matrix Factorization, Alternating Least Squares, NCF etc.) and compare their performance for any 2-evaluation metrics used for RecSys. You may read literature to find out which evaluation metrics are used for RecSys. Cite all your research.

[SOLVED] Csci 4170 homework 5 (100 points) cnns, aes, gans, attention mechanism

In your project, you will pick an image dataset to solve a classification task. Provide a link to your dataset. You may pick any dataset except MNIST, CIFAR or ImageNet. Task 1 (30 points): Part 1 (10 points): This step involves downloading, preparing, and visualizing your dataset. Create a convolutional base using a common pattern: a stack of Conv and MaxPooling layers. Depending on the problem and the dataset you must decide what pattern you want to use (i.e., how many Conv layers and how many pooling layers). Please describe why you chose a particular pattern. Add the final dense layer(s). Compile and train the model. Report the final evaluation and describe the metrics. Part 2 (10 points): The following models are widely used for transfer learning because of their performance and architectural innovations: 1. VGG (e.g., VGG16 or VGG19). 2. GoogLeNet (e.g., InceptionV3). 3. Residual Network (e.g., ResNet50). 4. MobileNet (e.g., MobileNetV2) Choose any one of the above models to perform the classification task you did in Part 1. Evaluate the results using the same metrics as in Part 1. Are there any differences? Why or why not? Describe in detail. Part 3 (10 points): Use data augmentation to increase the diversity of your dataset by applying random transformations such as image rotation (you can use any other technique as well). Repeat the process from part 1 with this augmented data. Did you observe any difference in results? Why or why not? Task 2 (15 points): Part 1 (7 points): Variational Autoencoder (VAE): Here is a complete implementation of a VAE in TensorFlow: https://www.tensorflow.org/tutorials/generative/cvae PyTorch implementation is fine too. Following these steps try generating images using the same encoder-decoder architecture using a different Image dataset (other than MNIST).Part 2 (8 points): Generative Adversarial Networks (GANs): Repeat part 1 (use same dataset) and implement a GAN model to generate high quality synthetic images. You may follow steps outlined here: https://www.tensorflow.org/tutorials/generative/dcgan Task 3 (55 points): NLP and Attention Mechanism Part 1 (10 points): Implement the scaled dot-product attention as discussed in class (lecture 14) from scratch (use NumPy and pandas only, no deep learning libraries are allowed for this step). Part 2 (10 points): Pick any encoder-decoder seq2seq model (as discussed in class) and integrate the scaled dot-product attention in the encoder architecture. You may come up with your own technique of integration or adopt one from literature. Hint: See Bahdanau or Luong attention paper presented in class (lecture 14). Part 3 (5 points): Pick any public dataset of your choice (use a small-scale dataset like a subset of the Tatoeba or Multi30k dataset) for machine translation task. Train your model from Part 2 for the machine translation task. Evaluate test set by reporting the BLEU Score. Part 4 (30 points): In this part you are required to implement a simplified Transformer model from scratch (using Python and NumPy/PyTorch/TensorFlow with minimal highlevel abstractions) and apply it to a machine translation task (e.g., English-to-French or English-to-German translation) using the same dataset from part 3. We discussed Transformer architecture in depth in class (Vaswani Paper – Attention is all you need). Apply the following simplifications to the original model architecture: 1. Reduced Model Depth: Use 2 encoder layers and 2 decoder layers instead of the standard 6. 2. Limited Attention Heads: Use 2 attention heads in the multi-head attention mechanism rather than 8. 3. Smaller Embedding Size: Set the embedding dimension to 64 instead of 512. 4. Reduced Feedforward Network Size: Use a feedforward dimension of 128 instead of 2048. 5. Smaller Dataset: Use a small dataset (e.g., about 10k sentence pairs). 6. Tokenization Simplifications: Use a basic subword tokenizer (like Byte Pair Encoding – BPE) or word-level tokenization instead of complex language-specific tokenizers. Key components to implement: 1. Positional Encoding: Implement Sinusoidal position encoding. 2. Scaled dot-product attention: Use the same implementation from part 1. Projects in Machine Learning and AI (RPI Spring 2025) 3. Multi-Head Attention: Integrate the scaled dot-product attention into a multihead attention framework using the specified simplifications. 4. Encoder and Decoder Blocks: Implement simplified encoder and decoder layers, ensuring: Layer normalization, Residual connections, Masked attention in the decoder for autoregressive generation. 5. Final Output Layer: Implement a linear layer followed by a SoftMax activation for generating translated tokens. Evaluation: Compute the BLEU score on a validation set and compare the performance with your model from part 2. Explain why there are differences in performance. Also discuss any other differences you notice, for example runtime etc. Project Progress Report (This is not graded) Please submit a report detailing your progress on the final project. This can be a 1 (maximum 2) page (word or pdf) long description of your data-collection/modelling/preliminary results related tasks. Also, describe the next steps towards your final goal. Task for 6000 level (Graduate level only): 100 points Medical Image Segmentation is an important problem in healthcare domain. Polyp recognition and segmentation is one field which helps doctors identify polyps from colonoscope images. CVC-Clinic database consists of frames extracted from colonoscopy videos. The dataset contains several examples of polyp frames & corresponding ground truth for them. The Ground Truth image consists of a mask corresponding to the region covered by the polyp in the image. The data is available in both .png and .tiff format here: https://polyp.grandchallenge.org/CVCClinicDB/ Consider this task as a minor research project in which you should research the existing models used (https://paperswithcode.com/dataset/cvc-clinicdb ) to identify polyps from these images. Report on the key findings and the evaluation metrics used for this problem. Variants of the Unet architecture are often used to solve this problem. Implement either Unet or any of its variants (Unet++, ResUnet etc.) to segment the polyp images. This may be a computation intensive task (requiring GPUs). In case you do not have access to GPUs simply reduce your training data size to train your model. Report your results, compare and contrast these results with at least 2 of the other research paper results.

[SOLVED] Csci 4170 homework 4 (100 points) sequence models

A Recurrent Neural Network (RNN) is a neural network that can be used when your data is treated as a sequence, where the order of the data-points matter. You will use an existing RNN in part 1, then implement an RNN in part 2. In part 3, you will demonstrate the usage of any of the word-embeddings we discussed in class. Upload a .txt file with a link to your file as your submission on Submitty (You may have different links for each task). You need to perform the following tasks for this homework (Task 1 is for Graduate (6000) level ONLY): 6000 level ONLY – Task 1 (50 points): This task involves training existing models. Download the character level RNN at https://github.com/karpathy/char-rnn You are required to read the documentation provided in this repository and experiment with the RNN model. This is a legacy repository; therefore, one task would be to research and use a recent version. Train the model on ‘tiny Shakespeare’ dataset available at the same location. Create outputs of the model after training for i) 5 epochs ii) 50 epochs and iii) 500 epochs. What significant difference do you observe between the 3 outputs? Explain. Repeat the experiment with the LSTM model provided in the repository. Explain the differences and/or similarities between the results of both models. Task 2 (50 points): In this task, you will pick a dataset (time-series or any other form of sequential data) and an associated problem that can be solved via sequence models. You must describe why you need sequence models to solve this problem. Include a link to the dataset source. Next, you should pick an RNN framework that you would use to solve this problem (This framework can be in TensorFlow, PyTorch or any other Python Package). Part 1 (10 points): Implement your RNN either using an existing framework OR you can implement your own RNN cell structure. In either case, describe the structure of your RNN and the activation functions you are using for each time step and in the output layer. Define a metric you will use to measure the performance of your model (NOTE: Performance should be measured both for the validation set and the test set). Part 2 (30 points): Update your network from part 1 with first an LSTM and then a GRU based cell structure (You can treat both as 2 separate implementations). Re-do the training and performance evaluation. What are the major differences you notice? Why do you think those differences exist between the 3 implementations (basic RNN, LSTM and GRU)?Note: In part 1 and 2, you must perform sufficient data-visualization, pre-processing and/or feature-engineering if needed. The overall performance visualization of the loss function should also be provided. Part 3 (10 points): Can you use the traditional feed-forward network to solve the same problem. Why or why not? (Hint: Can time series data be converted to usual features that can be used as input to a feed-forward network?) Task 3 (50 points): Part 1: Implementing Word Embeddings (10 points) • Use a pre-trained word embedding model (Word2Vec, GloVe, FastText, or BERT embeddings). • Provide a comparative discussion on why you chose this embedding over others. • Load embeddings efficiently (either from pre-trained vectors or using an NLP library like Gensim, SpaCy, or Hugging Face). • Allow dynamic user input of two words and output their respective embeddings. • Handle cases where a word is out of vocabulary (OOV) and suggest ways to approximate its embedding. Part 2: Cosine Similarity Computation (20 points) • Implement a function that computes the cosine similarity between two-word embeddings. • Explain why cosine similarity is useful in word embedding space. • Allow batch processing, where users can input multiple word pairs for simultaneous similarity computation. • Visualization Requirement: Create a 2D or 3D scatter plot (e.g., using PCA or t-SNE) to visually show how similar and dissimilar words cluster together in the embedding space. Part 3: Designing a Novel Dissimilarity Metric (20 points) • Define a custom dissimilarity score that goes beyond cosine similarity. Possible approaches include: o Euclidean distance (How far apart words are in vector space). o Word entropy-based dissimilarity (How uncommon two words are relative to each other in corpora). o Semantic contrast measure (Using external knowledge bases like WordNet). Projects in Machine Learning and AI (RPI Spring 2025) • Either design your own metric or cite an existing one from literature (provide a proper reference). Explain why your metric captures novelty/diversity better than cosine similarity alone. • Allow users to toggle between different similarity/dissimilarity measures via function parameters. • Visualization Requirement: o Plot the ranking of words based on their similarity/dissimilarity to a given word (e.g., how words like “cat” rank against “dog,” “lion,” and “table” using different metrics). o Use a heatmap to demonstrate and compare similarity and dissimilarity across multiple (any number of your choice) word pairs.

[SOLVED] Csci 4170 homework 3 (100 points) deep learning

Dataset selection: In your project, you will pick a dataset (not the same as in the previous homeworks) and describe the problem you would like to solve (classification or regression). Include a link to the dataset source. It is highly recommended that you pick a dataset with at least 10,000 (or more observations). There are many ways of describing a big dataset and one way to describe it is – a big dataset is more complex. Complexity can refer to the number of observations, features, or the type of data. For this project, there is no restriction on the number of features your dataset has. However, having more features gives you greater ability to apply the techniques discussed in class. Part 1 (50 points) In this part you will implement a neural network from scratch. You cannot use any existing Deep Learning Framework. You can utilize NumPy and Pandas libraries to perform efficient calculations. Refer to Lecture 5 slides for details on computations required. Write a Class called NeuralNetwork that has at least the following methods (you are free to add your own methods too): a. Initialization method. b. Forward propagation method that performs forward propagation calculations. c. Backward propagation method that implements the backpropagation algorithm discussed in class. d. Train method that includes the code for gradient descent. e. Cost method that calculates the loss function. f. Predict method that calculates the predictions for the test set. Test your NeuralNetwork Class with the dataset you selected. If the dataset is big, you may notice inefficiencies in runtime. Try incorporating different versions of gradient descent to improve that (Minibatch, Stochastic etc.). You may choose to use only a subset of your data for this task (or any other technique). Explain which technique you followed and why. Part 2 (50 points) In this part you will implement a 2-layer neural network using any Deep Learning Framework (e.g., TensorFlow, PyTorch etc.).You should pick a Deep Learning Framework that you would like to use to implement your 2- layer Neural Network. Task 1 (5 points): Assuming you are not familiar with the framework, in this part of the homework you will present your research describing the resources you used to learn the framework (must include links to all resources). Clearly explain why you needed a particular resource for implementing a 2-layer Neural Network (NN). (Consider how you will keep track of all the computations in a NN i.e., what libraries/tools do you need within this framework.) For example, some of the known resources for TensorFlow and PyTorch are: https://www.tensorflow.org/guide/autodiff https://www.tensorflow.org/api_docs/python/tf/GradientTape https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html Hint: You need to figure out the APIs/packages used to implement forward propagation and backward propagation. Task 2 (35 points): Once you have figured out the resources you need for the project, you should design and implement your project. The project must include the following steps (it’s not limited to these steps): 1. Exploratory Data Analysis (Can include data cleaning, visualization etc.) 2. Perform a train-dev-test split. 3. Implement forward propagation (clearly describe the activation functions and other hyper-parameters you are using). 4. Compute the final cost function. 5. Implement gradient descent (any variant of gradient descent depending upon your data and project can be used) to train your model. In this step it is up to you as someone in charge of their project to improvise using optimization algorithms (Adams, RMSProp etc.) and/or regularization. Experiment with normalized inputs i.e. comment on how your model performs when the inputs are normalized. 6. Present the results using the test set. NOTE: In this step, once you have implemented your 2-layer network you may increase and/or decrease the number of layers as part of the hyperparameter tuning process. Task 3 (10 points): In task 2 describe how you selected the hyperparameters. What was the rationale behind the technique you used? Did you use regularization? Why, or why not? Did you use an optimization algorithm? Why or why not?The following task is for Graduate level only (6000 level): Task 4 (100 points): Create another baseline model (can be any model we covered so far except a deep learning model). Using the same training data (as above) train your model and evaluate results using the test set. Compare the results of both models (the Neural Network and the baseline model). What are the reasons for one model performing better (or not) than the other? Explain.

[SOLVED] Csci 4170 homework 2 (100 points) ensemble learning

Task 1 (30 points): Implement a Decision Tree Classifier for your classification problem. You may use a built-in package to implement your classifier. Additionally, do the following: • Visualize the decision tree structure for at least three different parameter settings. Comment on how the depth and complexity change the tree. • Do some research on what sensitivity analysis is and how it is performed (include citations). Perform a sensitivity analysis to measure the impact of at least two input features on your model’s decision boundary. Task 2 (30 points): From the Bagging and Boosting ensemble methods pick any one algorithm from each category. Implement both the algorithms using the same data. • Use stratified k-fold cross-validation with at least three different folds (e.g., 5, 10, 15). You may do your own research on this technique (include citations). • Evaluate the models using any three-evaluation metrics of your choice (e.g. accuracy, Precision, F1-score etc.). • Comment on the behavior of each algorithm under the metrics. Does the performance ranking change based on the metric used? Why? Task 3 (40 points): Compare the effectiveness of the three models implemented above. Analyze the results using the following: • A confusion matrix for one selected test fold. • A statistical test (e.g., paired t-test) to determine if differences between models are significant. • A discussion on the trade-off between bias and variance for each model. The following task is for Graduate level only (6000 level): This task is more open ended and emphasizes the research aspect of implementing a model. You will be exploring the impact of hyperparameter tuning which we haven’t discussed in detail so far. Task (50 points): For the same classification problem solved above, implement the XGBoost algorithm. If you picked XGBoost as one of the boosting algorithms in task 2, you may use the same implementation. Implement and evaluate XGBoost with the following requirements: 1. Perform a grid search or random search over at least 3 hyperparameters, such as learning rate, max depth, and subsample.2. Analyze the sensitivity of your model to changes in these parameters. 3. Optional (no points taken off if not done) – Create plots to show the effect of each parameter on accuracy and another metric. Note: An experiment can be defined as a systematic way of picking parameter values. This could be something that you come up with yourself or you may refer to the exiting literature on design of experiments for hyperparameter tuning. This task will require you to do some research into this open-source library and hyper-parameter tuning yourself. A good place to start is here: https://www.jeremyjordan.me/hyperparameter-tuning/ https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

[SOLVED] Csci 4170 homework 1 (100 points) logistic regression implementation.

Task 1 (20 points): Advanced Objective Function and Use Case 1. Derive the objective function for Logistic Regression using Maximum Likelihood Estimation (MLE). Do some research on the MAP technique for Logistic Regression, include your research on how this technique is different from MLE (include citations). 2. Define a machine learning problem you wish to solve using Logistic Regression. Justify why logistic regression is the best choice and compare it briefly to another linear classification model (cite your work if this other technique was not covered in class). 3. Discuss how your dataset corresponds to the variables in your equations, highlighting any assumptions in your derivation from part 1. Task 2 (20 points): Dataset and Advanced EDA 1. Select a publicly available dataset (excluding commonly used datasets such as Titanic, Housing Prices or Iris). Provide a link to your dataset. Ensure the dataset has at least 10 features to allow for more complex analysis. 2. Perform Exploratory Data Analysis (EDA), addressing potential multicollinearity among features. Use Variance Inflation Factor (VIF) to identify highly correlated variables and demonstrate steps to handle them. 3. Visualize the dataset’s feature relationships, ensuring inclusion of at least two advanced visualization techniques (e.g., pair plots with KDE, heatmaps with clustering). Task 3 (20 points): Logistic Regression Implementation 1. Implement Logistic Regression from scratch, including the vectorized implementation of cost function and gradient descent. Projects in ML and AI (RPI Spring 2025) 2. Implement and compare the three gradient descent variants (e.g., batch gradient descent, stochastic gradient descent, and mini-batch gradient descent). Explain their convergence properties with respect to your cost function. (Refer to the research paper discussed in class; you may add additional research too). Task 4 (40 points): Optimization Techniques and Advanced Comparison 1. Implement or use packages to incorporate any three optimization algorithms (e.g., Momentum, RMSProp, Adam). Compare their performance with the vanilla stochastic gradient descent implementation from Task 3. 2. Define and use multiple evaluation metrics (e.g., precision, recall, F1 score) to analyze and interpret results for each algorithm. 3. Perform a hyperparameter tuning process (manual or automated using grid search/random search) for each optimization algorithm and assess its impact on performance. If you have to do some research for these techniques, please cite your sources. 4. Conclude by discussing the practical trade-offs of the algorithms, including computational complexity, interpretability, and suitability for large-scale datasets. (For more on evaluation metrics check this link: https://www.kdnuggets.com/2020/05/modelevaluation-metrics-machine-learning.html Research task (Not Graded): After finishing all the tasks try to think about any novel ways of optimization that you can come up with. Can you improve/update RMSProp and/or Adams? Can you add some minor adjustments to the momentum technique equation? If yes, then you should definitely try experimenting with your new technique. If it gives improved results in a particular scenario, then believe it or not you have invented something of your own and you are ready to publish! Keep thinking. Note: Grading will be focused on your understanding of the problem and the solution. Please make sure you explain everything you have implemented in your Jupyter Notebook. You must explain your results e.g. if the algorithm you implemented has a lower accuracy you should comment on some of the reasons behind the results. • Focus on demonstrating a deeper understanding of logistic regression concepts and their applications. • For Full credit, clearly explain every step and decision, providing detailed justifications in your Jupyter Notebook. • Discuss any unexpected outcomes in your results and hypothesize reasons for such behaviors.

[SOLVED] Comp 251: assignment 1 exercise 1 (60 points). building a hash table

Exercise 1 (60 points). Building a Hash Table We want to compare the performance of hash tables implemented using chaining and open addressing. In this assignment, we will consider hash tables implemented using the multiplication and linear probing methods. Note that the multiplication method described here is slightly different from the one that was seen in class, but the principle remains the same. We will (respectively) call the hash functions h and g and describe them below. Note that we are using the hash function h to define g. Collisions solved by chaining (multiplication method): h(k) = ((A · k) mod 2w) >> (w − r) Open addressing (linear probing): g(k, i) = (h(k) + i) mod 2r In the formula above, r and w are two integers such that w > r, and A is a random number such that 2 w−1 < A < 2 w. In addition, let n be the number of keys inserted, and m the number of slots in the hash tables. Here, we set m = 2r and r = dw/2e. The load factor α is equal to n m . We want to estimate the number of collisions when inserting keys with respect to keys and the choice of values for A. We provide you a set of two template files within COMP251HW1.zip that you will complete. This file contains two classes, one for each hash function. Those contain several helper functions, namely generateRandom that enables you to generate a random number within a specified range. Please read the provided code describing the hashtable classes with attention. Your first task is to complete the two java methods Open_Addressing.probe and Chaining.chain. These methods must implement the hash functions for (respectively) the linear probing and multiplication methods. They take as input a key k, as well as an integer 0 ≤ i < m for the linear probing method, and return a hash value in [0, m[. Next, you will implement the method insertKey in both classes, which inserts a key k into the hash table and returns the number of collisions encountered before insertion, or the number of collisions encountered before giving up on inserting, if applicable. Note that for this exercise, we define the number of collisions in open addressing as the number of keys encountered, or “jumped over” before inserting or removing a key (note that this definition only makes sense if the key is in the hash table). For chaining, we simply consider the number of other keys in the same bin at the time of insertion as the number collisions. You can assume the key is not negative, and that we will not attempt to insert a key that already exists in the hash table. You will also implement a method removeKey, this one only in Open_Addressing. This method should take as input a key k, and remove it from the hash table while visiting the minimum number of slots possible. Like insertKey, it should output the number of collisions if the key is found. If the key is not in the hash table, the method should simply not change the hash table, and output the number of slots visited before giving up. You will notice from the code and comments that empty slots are given a value of −1. If applicable, you are allowed to use a different notation of your choice for slots containing a deleted element. Make sure to test your assignment thoroughly by thinking about all the different situations that can occur when dealing with hash tables. Build your own hash table and try inserting and removing keys! For this question, you will need to submit your Chaining.java and Open_Addressing.java source files to the Assignment 1 => Q1 – Hash lesson in Ed-Lessons. You will not be tested on execution time for this question, but you will be tested on the efficiency of your program in terms of number of steps. You must implement your own hash table. Using the built-in hash table from Java will result in a 0 on this question. Exercise 2 (50 points). Building a Disjoint Set We want to implement a disjoint set data structure with union and find operations. The template for this program is available on the course website and named DisjointSets.java. In this question, we model a partition of n elements with distinct integers ranging from 0 to n − 1 (i.e. each element is represented by an integer in [0, · · · , n − 1], and each integer in [0, · · · , n − 1] represent one element). We choose to represent the disjoint sets with trees, and to implement the forest of trees with an array named par. More precisely, the value stored in par[i] is parent of the element i, and par[i]==i when i is the root of the tree and thus the representative of the disjoint set. You will implement union by rank and the path compression technique seen in class. The rank is an integer associated with each node. Initially (i.e. when the set contains one single object) its value is 0. Union operations link the root of the tree with smaller rank to the root of the tree with larger rank. In the case where the rank of both trees is the same, the rank of the new root increases by 1. You can implement the rank with a specific array (called rank) that has been added to the template, or use the array par (this is tricky). Note that path compression does not change the rank of a node. Download the file DisjointSets.java, and complete the methods find(int i) as well as union(int i, int j). The constructor takes one argument n (a strictly positive integer) that indicates the number of elements in the partition, and initializes it by assigning a separate set to each element. The method find(int i) will return the representative of the disjoint set that contains i (do not forget to implement path compression here!). The method union(int i, int j) will merge the set with smaller rank (for instance i) in the disjoint set with larger rank (in that case j). In that case, the root of the tree containing i will become a child of the root of the tree containing j), and return the representative (as an integer) of the new merged set. Do not forget to update the ranks. In the case where the ranks are identical, you will merge i into j. Once completed, compile and run the file DisjointSets.java. It should produce the output available in the file unionfind.txt available on MyCourses. Exercise 3 (90 points). Improving our discussion board The teaching staff in Comp251 is really happy of how our discussion board (Ed) is working; however, we believe there is one function missing. This function will allow us to identify important topics (discussed in Ed) by filtering key words. In particular, given a list of messages posted in Ed, we want a function that reports the words used by every single user on the discussion board. This list must be sorted from most to least used word (i.e., the word with the highest frequency must be the first one). In case of a frequency tie, the word needs to be sorted in alphabetical order. Let’s see now some features of the discussion board posts. The list of post will be provided to you as an array of strings (String[]), where every slot in the array will contain a message. All messages will have the following characteristics. • Each message is represented in Java as a String • Each message begins with a user’s name of no more than 20 characters. • After the name, each message continues with the content of that user’s post all in lower case. • The total number of characters across all messages, including spaces, will not exceed 2 ∗ 106 Let’s see now two examples to make sure that everything is clear. Given the following list of posts: David no no no no nobody never Jennifer why ever not Parham no not never nobody Shishir no never know nobody Alvin why no nobody Alvin nobody never know why nobody David never no nobody Jennifer never never nobody no Your algorithm must return the array [no, nobody, never] (exactly in that order). Those three words were used by every single user of our discussion board and they are reported in order of frequency (i.e., “no” is the most frequent used word). In case of a tie, the order was decided lexicographically. Now, if the following list of post is given to you: David comp Maria music Your algorithm must return an empty array [] given that any of the words in the post was used by every single user. For this question, you must implement your solution in the function Discussion_Board(String[] posts) which is inside the class/file A1_Q3.java. Please notice that for this question the correctness and efficiency of your algorithm will be tested, then it is of your interest to code your solution using the right algorithms and data-structures. Please notice that for this question, it is forbidden to create new classes. Exercise 4 (0 points). Least common multiple This optional problem is intended to prepare you for the midterm. We will provide solutions, but you will not receive marks for completing it. here, we aim to study an algorithm that computes, for an integer n ∈ N, the least common multiple (LCM) of integers ≤ n. For a given integer n ∈ N, let Pn = p x1 1 p x2 2 · · · p xk k , where p1, p2, · · · , pk is a strictly increasing sequence of prime numbers between 2 and n and for each i ∈ {1, · · · k}, xi is the integer such that p xi i ≤ n < pxi+1 i . For example, P9 = 23 · 3 2 · 5 · 7. More precisely, we’re going to compute all Pj , j ∈ {1, · · · , n} and store pairs of integers (p α , p) in a heap, a binary tree where the element stored in the parent node is strictly smaller than those stored in children nodes. For two given pairs of integers (a, b) and (a 0 , b0 ), (a, b) < (a 0 , b0 ) if and only if a < a0 . Let h denotes the tree height, we admit that h = Θ(log n). All levels of the binary tree are filled with data except for the level h, where elements are stored from the left to the right. After computing Pj , all pairs (p α , p) are stored in the heap such that p is a prime number smaller or equal to j and α is the smallest integer such that j < p α . For instance, after computing P9, we store (16, 2), (27, 3), (25, 5), and (49, 7) in the heap. The algorithm is iterative. We store in the variable LCM the least common multiple computed so far. At first, LCM= 2 is the LCM of integers smaller than 2 and the heap is constructed with only one node with value (4, 2). After finish the (j − 1)-th step, we compute the j-th step as follows: 1. If j is a prime number, multiply LCM by j and insert a new node (j 2 , j) in the heap. 2. Otherwise, if the root (p α , p) satisfies j = p α , then we multiply LCM by p, change the root’s value by (p α+1, p), and reconstruct the heap. We’re going to prove, step by step, that the time complexity of this algorithm is O(n √ n). 3.1 – 0 points In operation 1, a new node is inserted. What is the complexity of this operation? 3.2 – 0 points In operation 2, the heap is reconstructed. What is the time complexity of this operation? 3.3 – 0 points The number of prime numbers smaller than n concerned in the operation 2 is less than √ n. Prove that the number of times N we need to execute operation 2 to compute Pn is asymptotically negligible compared to n. Tip: you can prove this by proving N is o(n), where o (little o) denotes a strict upper bound. 3.4 – 0 points Assume the complexity of assessing whether an integer is a prime number is √ n and suppose multiplication has a time complexity of 1. Prove that the algorithm’s complexity is O(n √ n). 3.5 – 0 points Prove that, for a given heap of height h with n nodes, we have h = Θ(log n). No partial credit will be awarded. What To Submit? Attached to this assignment are java template files. You have to submit only this java files. Please DO NOT zip (or rar) your files, and do not submit any other files. Where To Submit? You need to submit your assignment in ed – Lessons. Please review the tutorial 2 if you still have questions about how to do that (or attend office hours). Please note that you do not need to submit anything to myCourses. When To Submit? Please do not wait until the last minute to submit your assignment. You never know what could go wrong during the last moment. Please also remember that you are allowed to have multiple submission. Then, submit your partial work early and you will be able to upload updated versions later (as far as they are submitted before the deadline). How will this assignment be graded? Each student will receive an overall score for this assignment. This score is the combination of the passed open and private test cases for the three questions of this assignment. The open cases correspond to the examples given in this document plus other examples. These cases will be run with-in your submissions and you will receive automated test results (i.e., the autograder output) for them. You MUST guarantee that your code passes these cases. In general, the private test cases are inputs that you have not seen and they will test the correctness of your algorithm on those inputs once the deadline of the assignment is over; however, for this assignment you will have information about the status (i.e., if it passed or not) of your test. Please notice that not all the test cases have the same weight. Student Code of Conduct Assignment Checklist The instructor provides this checklist with each assignment. The instructor checks the boxes to items that will be permitted to occur in this assignment. If an item is not checked or not present in the list, then that item is not allowed. The instructor may edit this list for their case. A student cannot assume they can do something if it is not listed in this checklist, it is the responsibility of the student to ask the professor (not the TA). Instructor’s checklist of permitted student activities for an assignment: Understanding the assignment: Read assignment with your classmates Discuss the meaning of the assignment with your classmates Consult the notes, slides, textbook, and the links to websites provided by the professor(s) and TA(s) with your classmates (do not visit other websites) Use flowcharts when discussing the assignment with classmates. Ask the professor(s) and TA(s) for clarification on assignment meaning and coding ideas. Discuss solution use code Discuss solution use pseudo-code Discuss solution use diagrams Can discuss the meaning of the assignment with tutors and other people outside of the course. Look for partial solutions in public repositories Doing the assignment: ● Writing Write the solution code on your own Write your name at the top of every source file with the date Provide references to copied code as comments in the source code (e.g. teacher’s notes). Please notice that you are not allowed to copy code from the internet. Copied code is not permitted at all, even with references Permitted to store partial solutions in a public repository ● Debugging Debug the code on your own Debugging code with the professor Debugging code with the TA Debugging code with the help desk Debugging code with the Internet. Please notice that this is allowed to debug syntax errors, no logic errors. You can debug code with a classmate You can debug code with a tutor or other people outside of the course ● Validation Share test cases with your classmates ● Internet Visit stack-overflow (or similar). Please notice that this is allowed only to debug syntax errors, no logic errors. Visit Chegg (or similar) ● Collaboration Show your code with classmates Sharing partial solutions with other people in the class Can post code screenshots on the course discussion board Can show code to help desk Submitting and cleaning up after the assignment: Backup your code to a public repository/service like github without the express written permission from the professor (this is not plagiarism, but it may not be permitted) Let people peek at your files Share your files with anyone ZIP your files and upload to the submission box Treat your work as private Make public the solutions to an assignment Discuss solutions in a public forum after the assignment is completed

[SOLVED] Assignment 3 comp 250 in this assignment you will be learning about decision trees

In this assignment you will be learning about decision trees and how to use them to solve classification problems. Working on this problem will allow you to better understand how to manipulate trees and how to use recursion to exploit their recursive structure. 1 Introduction Congratulations! You have just landed an internship at a startup software company. This company is trying to use AI techniques – in particular decision trees – to analyze spatial data. Your first task in this internship is to write a basic decision tree class in Java. This will demonstrate to your new employer that you understand what decision trees are and how they work. From a quick web search, you learn that decision trees are a classical AI technique for classifying objects by their properties. One typically refers to object attributes rather than object properties, and one typically refers object labels to say how an object is classified. As a concrete example, consider a computer vision system that analyzes surveillance video in a large store and it classifies people seen in the video as being either employees or customers. An example attribute could be the location of the person in the store. Employees tend to spend their time in different places than customers. For example, only employees are supposed to be behind the cash register. For classification problems in general, one defines object attributes with x variables and one defines the object label as a y variable. In the example that you will work with in this assignment, the attributes will be the spatial position (x1, x2), and the label y will be a color (red or green). Let’s get back to decision trees themselves. Decision trees are rooted trees. So they have internal nodes and external nodes (leaves). To classify a data item (datum) using a decision tree, one starts at the root and follows a path to a leaf. Each internal node contains an attribute test. This test amounts to a question about the value of the attribute – for example, the location of a customer in a store. Each child of an internal node in a decision tree represents an outcome of the attribute test. For simplicity, you will only have to deal with binary decision trees, so the answers to attribute test questions will be either true or false. A test might be x1 < 5. The answer determines which child node to follow on the path to a leaf. The labelling of the object occurs when the path reaches a leaf node. Each leaf node contains a label that is assigned to any test data object that arrives at that leaf node after traversing the tree from the root. The label might be red or green, which could be coded using an enum type, or simply 0 or 1. Note that, for any test data object, the label given is the label of the leaf node reached by that object, which depends on the outcomes of the attribute tests at the internal nodes. The reason that this document is larger than usual is that decision trees were not covered the pre-recorded lectures. This document should give you enough information about decision trees for you to do the assignment. If you wish to learn more about decision treees, then there are ample resources available on the web. Steer towards resources that are about decision trees in computer science, in particular, in machine learning or data mining. For example: https://en.wikipedia.org/wiki/Decision_tree_learning 3 1.1 Creating decision trees 1 INTRODUCTION Be aware that these resources will contain more information than you need to do this assignment, and so you would need to sift through it and figure out what is important and what can be ignored. Feel free to use Piazza to share links to good resources and to resolve questions you might have. The task of understanding what decision trees are is part of the assignment. The amount of coding you need to do is relatively small, once you figure out what needs to be done. 1.1 Creating decision trees To classify objects using a decision tree, we first need to have a decision tree! Where do decision trees come from? In machine learning, one creates decision trees from a labelled data set. Each data item (datum) in the given labelled data set has well defined attributes x and label y. We refer to the data set that is used to create a decision tree as the training set. The basic algorithm for creating a decision tree using a training set is as follows. This is the algorithm that you will need to implement for fillDTNode() later. Data: data set (training) Result: the root node of a decision tree MAKE DECISION TREE NODE(data) if the labelled data set has at least k data items (see below) then if all the data items have the same label then create a leaf node with that class label and return it; else create a “best” attribute test question; (see details later) create a new node and store the attribute test in that node, namely attribute and threshold; split the set of data items into two subsets, data1 and data2, according to the answers to the test question; newNode.child1 = MAKE DECISION TREE NODE(data1) newNode.child2 = MAKE DECISION TREE NODE(data2) return newNode end else create a leaf node with label equal to the majority of labels and return it; end In the program, k is an argument of the decision tree construction minSizeDatalist. 1.2 Classification using decision trees Once you have a decision tree, you can use it to classify new objects. This is called the testing phase. For the testing phase, one can use data items from the original data used for training (above) or one can use new data. Typically when a decision tree is used in practice, the test objects are unlabelled. In the surveillance example earlier, the system would test a new video and try to classify people as employees versus customers. Here the idea is that one does not know the correct class for each person. Let’s consider this general scenario now, that we are given a decision tree and the attributes of some new unlabelled test object. We will use the decision tree to choose a label for the object. This is done by traversing the decision tree from the root to a leaf, as follows: 4 2 INSTANTIATING THE DECISION TREE PROBLEM Data: A decision tree, and an unlabelled data item (datum) to be classified Result: (Predicted) classification label CLASSIFY(node, datum) { if node is a leaf then return the label of that node i.e. classify; else test the data item using question stored at that (internal) node, and determine which child node to go to, based on the answer ; return CLASSIFY(child, datum); end } 2 Instantiating the decision tree problem For this assignment, the problem is to classify points based on their position. Each datapoint has an array of attributes x, and a binary label y (0 or 1). For this section, we will focus on datapoints with only two attributes. A graphical representation of example of a data set looks like this. (For the graphs, the attribute value x[0], is represented as x1 and x[1] as x2.) For those who print out the document in color, the red symbols can be label 0 and the green symbols can be label 1. For those printing in black and white, the (red) disks are label 0 and the (green) ×’s are label 1. Figure 1: x1 is horizontal and x2 is vertical coordinate. Note that the points are intermixed. There is no way to draw a horizontal or vertical line or any curve for that matter that could split the data. 5 2.1 Finding a good split 2 INSTANTIATING THE DECISION TREE PROBLEM 2.1 Finding a good split Now that we have an idea of what the data are, let us return to the question of how to split the data into two sets when creating a node in a decision tree. What makes a ‘good’ split? Intuitively, a split is good when the labels in each set are as ‘pure’ as possible, that is, each subset is dominated as much as possible by a single label (and the dominant label differs between subsets). For example, suppose this is our data: Figure 2: What would be a good split of this data? Two of many possible splits we could make are shown in Fig. 3. Fig 3-a splits the data into two sets based on the test condition (x1 < 4), i.e. true or false. (By definition, the green symbol that falls on this line is considered to be in the right half since the inequality is strict.) This is a good split in that all data points for which the test condition is false have the same label (green) and all data points for which the test condition is true have the same label (red), and the labels differ in the two subsets. The split condition (x1 < 6) in Fig. 3-b is not as good, since the subset for which the condition is true contains datapoints of both labels. (a) (b) Figure 3: Example of different splits on the x1 attribute. 6 2.2 Entropy 2 INSTANTIATING THE DECISION TREE PROBLEM Splits can be done on either of the attributes. For the example in Fig. 4-a, a good split would be defined by the test condition (x2 < 4). The situation is more complicated when the data points cannot be separated by a threshold on x1 or x2, as in Fig 4-b, however. It is unclear how to decide which of the three splits is best. We need a quantitative way of deciding how to do this. (a) (b) Figure 4: (a) Example of a data set and a split using the x2 attribute, where the two subsets have distinct labels. (b) An example of a data set in which there is no way to split using either an x1 or x2 value, such that the two subsets have distinct labels. 2.2 Entropy To handle more complicated situation, ones needs a quantitative measure of the ‘goodness’ of a split, in terms of the impurity of the labels in a set of data points. There are many ways to do so. One of the most common is called entropy. 1 The standard definition2 of entropy is: H = − X i pi log2 (pi) (1) where p(i) is a function such that 0 ≤ pi ≤ 1 and P i pi = 1. Note that the minus sign is needed to make H positive, since log2 pi < 0 when 0 < pi < 1. Also, note that if pi = 0 then pi log2 (pi) = 0 since that is the limit of this expression as pi → 0. (Recall l’Hopital’s Rule in Calculus 1.) For the special case that there are two values only, namely p1 and p2 = 1 − p1, entropy H is between 0 and 1, and we can write H as a function of the value p = p1. For a plot of this function H(p), see: https://en.wikipedia.org/wiki/Binary_entropy_function 1Entropy is an extremely important concept in science. It has its roots in thermodynamics in the 19th century. In the 20th century, “information entropy” was one of the basic for techniques in electronic communication (telephones, cell phones, internet, etc). In computer science, information entropy is heavily used in data compression, cryptography, and AI. 2Such functions pi are often used to model probabilities, as you will learn if you take MATH 323 or MATH 203 for example. 7 2.3 Using entropy to define a good split 2 INSTANTIATING THE DECISION TREE PROBLEM In this assignment, we use entropy to choose the best split for a data set, based on its labels. We borrow the formula for entropy and apply it to our problem as follows: H(D) = − X y∈L p(y) log2 p(y) (2) where • L is the set of labels, and y is a particular label • D is a data set; each data point has two attributes and a label i.e. (x1, x2, y) • H(D) is the entropy of the dataset D • p(y) is the fraction of data points with label y from dataset D. Since L consists of only two labels, entropy is between 0 and 1. Entropy is 0 if p(y) takes values 0 and 1 for the two labels. Entropy is 1 if p(y) = 0.5 for both labels. Otherwise it has a value strictly between 0 and 1. See plot in the link above. 2.3 Using entropy to define a good split During the training phase, when one constructs the decision tree, a node is given a data set D as input. If D has entropy greater than 0, then we would like to split the data set into two subsets D1 and D2. We would like the entropy of the subsets to be lower than the entropy of D. The subsets may have different entropy, however, so we consider the average entropy of the subsets. Moreover, because one subset might be larger than the other, we would like to give more weight to the larger subset. So we define the average entropy like this: H(D1, D2) ≡ w1 × H(D1) + w2 × H(D2) where wi is the fraction of the points in subset i, wi = number of datapoints in Di number datapoints in D, namely D1 + D2 and i is either 1 or 2. Note that w1 + w2 = 1. NOTE: DO NOT use the formula : w2 = 1 – w1, although correct, this leads to a numerical approximation error. For each of the weights (w1, w2) use the formula mentioned above separately. ASIDE: (We mention the following because you will likely encounter it in your reading.) When building a decision tree, one often considers the difference H(D) − H(D1, D2), which is called the information gain. For example, one may decide whether or not to split a node based on whether the information gain is sufficiently large. In this assignment, you will instead use a different criterion to decide whether to split a node when building the decision tree Your criterion will be based on the number of data items in D, as will be discussed later. 8 2.4 Example 2 INSTANTIATING THE DECISION TREE PROBLEM 2.4 Example Let us now return to the examples we saw earlier and use entropy to discuss which split is best. Recall the example of Fig. 4-b. To calculate the entropy of the dataset D before split, note there are two classes with 6 points in each of the classes. So the fraction of points in each class is 0.5 and the entropy is: H(D) = − 1 2 log2 1 2 − 1 2 log2 1 2 = 1 (3) • Split 1 breaks the dataset into two sub datasets, where the subdataset on top contains 8 points (5 red dots , 3 green crosses) and the one below has 4 points (1 red dot, 3 green crosses). Calculating average entropy H(D1, D2) after the split yields: 4 12 − 1 4 log2 1 4 + 3 4 log2 3 4 + 8 12 − 5 8 log2 5 8 + 3 8 log2 3 8 = 0.906 (4) • Split 2 breaks the dataset into two sub datasets, where the sub dataset on the left has 5 datapoints (4 red dots, 1 green cross) and the other one has 7 datapoints ( 2 red dots , 5 green crosses). The average entropy H(D1, D2) after the split is: 5 12 − 1 5 log2 1 5 + 4 5 log2 4 5 + 7 12 − 2 7 log2 2 7 + 5 7 log2 5 7 = 0.803 (5) • Split 3 breaks the dataset into 2 sub datasets, where the sub dataset on the left has 7 datapoints (5 red dots, 2 green crosses) and the other one has 5 datapoints ( 1 red dot , 4 green crosses). The average entropy H(D1, D2) after the split is: 7 12 − 2 7 log2 2 7 + 5 7 log2 5 7 + 5 12 − 1 5 log2 1 5 + 4 5 log2 4 5 = 0.803 (6) So, split 2 and 3 have lower average entropy than split 1. It is no problem that split 2 and 3 have the same average entropy. Either one can be chosen as the ‘best split’. For the assignment, select the first encountered best split when the check for the best split starts from the first attribute (x[0]) and proceeds from there and for a given attribute the check starts from the first datapoint and proceeds thereafter. 2.5 Finding the best split We can use entropy to compare the ‘goodness’ of different splits. The simplest way to define the best split is just to consider all possible splits and choose the one with the lowest average entropy. In this assignment, you will take this brute force approach. You will consider each attribute x1 and x2 and each of the values of that attribute for points in the data set. You will compute the average entropy for splitting on that attribute value, and you will choose the split that gives the minumum. 9 2.5 Finding the best split 2 INSTANTIATING THE DECISION TREE PROBLEM Data: A Dataset Result: An attribute and a threshold FIND BEST SPLIT(data) { best avg entropy := inf; best attr := -1; best threshold := -1; for each attribute in x do for each data point in list do compute split and current avg entropy based on that split; if best avg entropy > current avg entropy then best avg entropy := current avg entropy; best attr := attribute; best threshold := value; end end end return (best attr, best threshold) } Note the order of the two for loops! If you use the opposite order, you might get a different tree (wrong answer).Note also that if the minim average entropy you find is equal to the entropy of the input data set, then there is no point in performing a split. In this case, the node should simply be made a leaf with label equal to the majority of the labels in the data set. So, far we have seen how to create the decision tree and what are the intuitions behind the math needed to get the split to build the tree. Let us consider another example. Consider the data shown below. A decision tree for this dataset should have splits as shown below. 10 2.6 Preventing Overfitting 3 INSTRUCTIONS AND STARTER CODE Figure 5: Example of an overfit decision tree on the outlier dataset But does this seem right? Given that all the points around it are of class 1 and the other points of the same class are far to the left, it might be possible that the datapoint labeled class 2 at the right side of the graph (6,6) is an outlier or an anomalous reading in the data. Points like these are most of the time, if not always present in a dataset and care should be taken so that our decision tree does not try too hard to reduce the impurity of the sub datasets and in the process actively keep splitting the data so as to try to classify the outliers into purer groups. This phenomenon described above is called overfitting. The algorithm – in this case the construction of a decision tree – tries to find a “model” that accounts for all of the data, even the data which might be garbage for some reason (noise, or due to some error in the program or device that produced the data). 2.6 Preventing Overfitting There are different ways to prevent overfitting. The method that you will use is called early stopping, namely stop splitting nodes in the decision tree when the number of data points in a subset is smaller than some predetermined number. The issue of overfitting is very important in decision trees and in machine learning in general, but the details are beyond the scope of this assignment. 3 Instructions and starter code The starter code contains three classes: Datum.java This class holds the information of a single datapoint. It has two variables, x and y. x is an array containing the attributes and y contains the label. The class also comes with a method toString() which returns a string representation of the attributes and label of a single datapoint. 11 3 INSTRUCTIONS AND STARTER CODE DataReader.java This class deals with three things. The method read data() reads a dataset from a csv3 , and splits the read dataset into the training and test set, using splitTrainTestData() file. It also has methods that deal with reading and writing of “serialized” decision tree objects. DecisionTree.java This is the main class which deals with the creation of a decision tree and classification of datapoints using the created decision tree. You will be implementing some of the methods in this class. Let us go through the different members and attributes of the class: • The constructor builds a decision tree by calling the fillDTNode() method on a dataset. It is given a list of data points and a parameter that specifies the minimum number of datapoints that has to be present in a given dataset to qualify for a split. This minimum number is used to reduced the chances of overfitting, as discussed above. • There is a root node field, called rootDTNode, by which other nodes can be accessed. • There is a field called minSizeDatalist used to store the minimum number of datapoints that should be present in the dataset so as to initiate a split. • There is a subclass class DTNode. This class is used to represent a single node of a decision tree. There are two types of nodes: the internal nodes which define an attribute and a threshold which help in classification, and leaf nodes which determine the labels of those data points whose attributes obey the threshold conditions of the ancestor internal nodes leading up to the leaf node. The DTNode contains the following members: – leaf: a boolean variable that indicates whether this node object is a leaf or not. – label: an integer variable that indicates the label of the node. The label of the node indicates the class of a datapoint that reaches that particular leaf node after traversing the tree. This is valid only if the node is a leaf node. The classes for this assignment are simply 0 or 1. – attribute: The index of the attribute, (i.e. 1, . . . , n) on which the dataset is split in that particular node. The value of the attribute is one of {1, . . . , n}. The attribute value is meaningful only for an internal node. The value stored in this field is meaningful only for an internal node. In the code, the attributes are x[0], x[1], .., x[n-1], hence their indices values are 0, 1, . . . , n − 1, respectively. – threshold: This holds the value of the attribute at which the split is done. This is also only meaningful for an internal node. – left, right These two variables are of type DTNode, and they represent the two children of an internal node. At the classify stage, the left child leads to a decision tree node that handles 3According to https://en.wikipedia.org/wiki/Comma-separated_values, comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. 12 3 INSTRUCTIONS AND STARTER CODE the case that the value of the attribute is less than the threshold, and right handles the case that the value is greater than or equal to the threshold. For a leaf node, they are both null. The class DTNode also contains a few member methods that helps in building the tree. – fillDTNode(): This is the method that does all the heavy lifting in the entire assignment. Given a list of data points and a minimum size for splitting (see earlier), this recursive function creates the entire decision tree. A detailed description is present in the comments of the code. – findMajority(): Given a list of datapoints D, this method returns a label for that set. This method is called during training (construction of the decision tree). It can happen that the size of a dataset falls below the minimum size mentioned in minSizeDatalist, but still contains datapoints from more than one class. In such a case, a leaf node is created. To determine the label for this leaf node, the method goes through all the points and finds the majority label, i.e. the most common label for those points. This label will be used later at the classification phase, when a new data point reaches that leaf node. When choosing the label for a leaf node, if there is no majority (a tie), then the label with the smallest value is returned. – classifyAtNode(): At the testing phase, given a datapoint with only its attributes (no label), this method uses the label specified by the decision tree leaf node (as was determined at training time by the findMajority() method). – equals() : Given another node, this method checks if the tree rooted at the given node is equal to the tree rooted at the calling node. The definition of ‘equality’ is elaborated in the next section. • calcEntropy(): Given a dataset, this function calculates the entropy of the dataset. • classify(): Given a datapoint (without the label), predicts the label of the datapoint. The only difference between this method and classifyAtNode() is that classifyAtNode() does the classification on its member DTNode, whereas for classify() the DTNode is the root of the created decision tree. • checkPerformance(): Given a dataset where the datapoints have both attributes and labels, this method runs the classify() function on all of the datapoints (using only the attributes) and compares the label that is given by the decision tree with the “ground truth” label for that data point. The method returns the fraction of datapoints that were predicted wrong, in the form of a string. • equals(): Given two decision trees, this method checks if the two trees are equal or not. It returns a boolean value. Your task You need to implement three methods from the DTNode class. None of the methods depend on each other. We suggest that you implement equals() first. 1. DTNode.equals() (30 points) This method compares two DTNodes. Given another DTNode object, it checks if the tree that is rooted at the calling DTNode is equal to the tree rooted at DTNode object that is passed as the 13 4 DATA parameter. Two DTNodes are considered equal if: (a) a traversal (e.g. preorder) of each of the two trees encounters nodes that are equal; (b) internal node : the thresholds and attributes should be same; (c) leaf node : the labels should be same. Note that the tester you are given uses this equals method to check if the tree you implement in the second part matches with the actual solution. 2. DTNode.fillDTNode() (50 points) This method takes in a datalist (i.e. an arraylist of objects of type Datum) and returns the calling DTNode object as the root of a decision tree trained using the datapoints present in the input datalist. 3. DTNode.classifyAtNode() (20 points) This method takes in a datapoint (excluding the label) in the form of an array of type double (like Datum.x) and should return its corresponding label (int). 4 Data The different datasets used in the assignment are shown below. If you look at the DataReader() class, you will note that only half of the data in each plot is used in the training. The other half is used to test the performance of the decision tree. This testing phase is not part of the assignment, but we give you some performance data so that you can appreciate the differences in the data sets. For each of the three data sets, we list the performance of the decision tree in classifying the other half (test set) of the data. We show both the fraction of training points that are misclassified (error) and the fraction of the test points that are misclassified, and we do so for different values of the variable minSizeDatalist. 4.1 Highly overlapping data : Notice that the training error is 0 when the minimum size is 1 and rises as minSizeDatalist increases. However, the test error is near .5 (50 percent) for all values of minSizeDatalist. minSizeDatalist : 1 Training error : 0.000 Test error : 0.495 minSizeDatalist : 2 Training error : 0.000 Test error : 0.495 minSizeDatalist : 4 Training error : 0.030 Test error : 0.495 minSizeDatalist : 8 Training error : 0.105 Test error : 0.520 minSizeDatalist : 16 Training error : 0.200 Test error : 0.515 minSizeDatalist : 32 Training error : 0.235 Test error : 0.515 minSizeDatalist : 64 Training error : 0.310 Test error : 0.525 minSizeDatalist : 128 Training error : 0.390 Test error : 0.490 14 4.2 Partially overlapping data : 4 DATA Figure 6: Plot of data present in data high overlap.csv 4.2 Partially overlapping data : Figure 7: Plot of data present in data partial overlap.csv In this case, 10 or 12 of the 200 test data points are misclassified for most values of minSizeDatalist. The error rises considerably when minSizeDatalist is 128. minSizeDatalist : 1 Training error : 0.000 Test error : 0.050 minSizeDatalist : 2 Training error : 0.000 Test error : 0.050 minSizeDatalist : 4 Training error : 0.015 Test error : 0.050 minSizeDatalist : 8 Training error : 0.035 Test error : 0.060 minSizeDatalist : 16 Training error : 0.045 Test error : 0.050 15 4.3 Minimal overlapping data : 4 DATA minSizeDatalist : 32 Training error : 0.045 Test error : 0.050 minSizeDatalist : 64 Training error : 0.075 Test error : 0.050 minSizeDatalist : 128 Training error : 0.255 Test error : 0.245 4.3 Minimal overlapping data : Figure 8: The data points here have very little overlap. We are loosely calling the overlap “minimal”, although truly minimal would mean zero overlap. We leave some overlap so that overfitting can potentially occur. Note exactly one (of 200) training data points is misclassfied when minSizeDatalist is 4 or more. The test error is roughly constant. It is slightly larger for small values of minSizeDatalist. minSizeDatalist : 1 Training error : 0.000 Test error : 0.040 minSizeDatalist : 2 Training error : 0.000 Test error : 0.040 minSizeDatalist : 4 Training error : 0.005 Test error : 0.040 minSizeDatalist : 8 Training error : 0.005 Test error : 0.035 minSizeDatalist : 16 Training error : 0.005 Test error : 0.035 minSizeDatalist : 32 Training error : 0.005 Test error : 0.035 minSizeDatalist : 64 Training error : 0.005 Test error : 0.035 minSizeDatalist : 128 Training error : 0.005 Test error : 0.035

[SOLVED] Assignment 2 comp 250

There are several learning goals for this assignment. First, you will get some exposure to some simple cryptography. We’ll introduce the idea behind one-time pad and you will implement an example of a stream cipher. Second, in this assignment you will also get some experience working with linked lists. You will implement a data structure to represent a deck of cards. This data structure is implemented as a circular doubly linked list. Third, in this assignment we will start to focus also on the efficiency of your algorithms. You will learn to look at code with a more critical eye, without only focusing on the correctness of your methods. Lastly, this assignment will also give you more practice programming in Java! Although COMP 250 is not a course about how to program, programming is a core part of computer science and the more practice you get, the better you will become at it. Introduction In 1917, Vernam patented a cipher now called one-time pad encryption scheme. The point of an encryption scheme is to transform a message so that only those authorized will be able to read it. One-time pad was later (in 1949) proved to be perfectly secret. The idea behind one-time pad is that given a plaintext message of length n, a uniformly random stream of digits of length n (which is the key) is generated and then used to encode the message. The message is concealed by replacing each character in the plaintext, with a character obtained combining the original one with one of the digits in the given key. Of course the message can be retrieved by performing the inverse operation 2 on the characters of the encoded message (the ciphertext). Only those with access to the key can encode and decode a message. One-time pad is perfectly secret, but it has a number of drawbacks: for it to be secure, the key is required to be as long as the message, and it can only be used once! This clearly makes the cipher not a convenient one to use. Unfortunately, it was also proven that the limitations of one-time-pad are inherent to the definition of perfect secrecy. This means that to overcome those limitations the security requirements have to be relaxed. Stream ciphers use the same idea of one-time pad encryption scheme except that a pseudorandom sequence of digits is used as the pad instead of a random one. The idea is to use what are called ‘pseudorandom generators’ which given a smaller key can generate streams of pseudorandom digits. In Neal Stephenson’s novel Cryptonomicon, two of the main characters are able to covertly communicate with one another with a deck of playing cards and knowledge of the Solitaire encryption algorithm, which was created (in real life) by Bruce Schneier. The novel includes the description of the algorithm, but you can also find a revised version on the web1 . The Solitaire encryption algorithm is an example of a stream cipher. The key in this case is the deck of cards in its initial configuration. If two parties, Alice and Bob, share the same deck, following the Solitaire encryption algorithm they will be able to communicate by encoding and decoding messages. Of course, the deck and its configuration (i.e. the key) has to be kept secret to achieve secrecy. To encode and decode messages, Alice and Bob use the deck to generate a pseudorandom keystream which is then used as the “pad”. Encode/Decode with Solitaire Given a message to encode, we need to first remove all non–letters and convert any lower–case letters to upper–case. We then use the keystream of values and convert each letter to the letter obtained by shifting the original one a certain number of positions to the right on the alphabet. This number is the one found in the keystream in the same position as the character we are encoding. Decryption is just the reverse of encryption. Using the same keystream that was used to generate the ciphertext, convert each letter to the letter obtained by shifting the original one the given number of positions to the left on the alphabet. For example, let’s say that Alice wants to send the following message: Is that you, Bob? Then she will first remove all the non-letters and capitalize all the remaining ones obtaining the following: ISTHATYOUBOB She will then generate a keystream of 12 values. We’ll talk about the keystream generation in the next section, so let’s assume that the keystream is the following: 11 9 23 7 10 25 11 11 7 8 9 3 1See https://en.wikipedia.org/wiki/Solitaire (cipher), or https://www.schneier.com/academic/solitaire/ 3 Finally, she can generate the ciphertext by shifting each letter the appropriate number of positions to the right in the alphabet. For example, the ‘I’ shifted 11 positions to the right, becomes a ‘T’. The ‘S’ shifted 9 positions to the right becomes a ‘B’. And so on! The final ciphertext will be: TBQOKSJZBJXE Bob, upon receiving the message, will need to generate the keystream. If Alice and Bob shared the same key and used it to generate the same number of pseudorandom values, then the keystream generated in this moment by Bob will be equal to that used by Alice to encrypt the message. All there’s left for Bob to do is convert all the letters by shifting them the appropriate number of position to the left. Generating a Keystream Using a Deck of Cards The harder part of the Solitaire encryption algorithm is generating the keystream. The idea is to use a deck of playing cards plus two jokers (a red one and a black one). Each card is associated with a value which depends on its rank and its suit. Cards in order from Ace to King have value 1 to 13 respectively. This value can increase by a multiple of 13 depending on the suit of the card. For this section let’s assume we’ll use the Bridge ranking for suits: clubs (lowest), followed by diamonds, hearts, and spades (highest). So, for instance, the Ace of clubs has value 1, while the 5 of diamonds has value 18, and the Queen of spades has value 51. The jokers have a value that depends on the number of cards in the deck. If the deck has a total of 54 cards (the 52 playing cards plus the two jokers), then the jokers have value 53. If the deck has total of 28 cards, then the jokers have value 27. That is, the jokers have both the same value and this value is equal to the total number of cards in the deck minus one. The keystream values depend solely on the deck’s initial configuration. We will implement the deck as a circular doubly linked list with the cards as nodes. This means that the first card (the one on the top of the deck) is linked to the last card (the one at the bottom of the deck) and the last card is linked to the first one. As an example, let’s consider a deck with 28 cards: the 13 of both clubs and diamonds, plus the two jokers. Let’s also consider the following initial configuration2 : AC 4C 7C 10C KC 3D 6D 9D QD BJ 3C 6C 9C QC 2D 5D 8D JD RJ 2C 5C 8C JC AD 4D 7D 10D KD The cards are represented with their rank, followed by their suit. For example, 6C denotes the 6 of clubs, JD the Jack of diamonds, and RJ the red joker. Here are the steps to take to generate one value of the keystream: 1. Locate the red joker and move it one card down. (That is, swap it with the card beneath it.) If the joker is the bottom card of the deck, move it just below the top card. There is no way for it to become the first card. After this step, the deck above will look as follows: AC 4C 7C 10C KC 3D 6D 9D QD BJ 3C 6C 9C QC 2D 5D 8D JD 2C RJ 5C 8C JC AD 4D 7D 10D KD 2Note that this is the same example you find on the wikipedia page https://en.wikipedia.org/wiki/Solitaire (cipher) where instead of the cards, they list the values. 4 2. Locate the black joker and move it two cards down. If the joker is the bottom card of the deck, move it just below the second card. If the joker is one up from the bottom card, move it just below the top card. There is no way for it to become the first card. After this step, the deck above will look as follows: AC 4C 7C 10C KC 3D 6D 9D QD 3C 6C BJ 9C QC 2D 5D 8D JD 2C RJ 5C 8C JC AD 4D 7D 10D KD 3. Perform a “triple cut”: that is, swap the cards above the first joker with the cards below the second joker. Note that here we use “first” and “second” joker to refer to whatever joker is nearest to, and furthest from, the top of the deck. Their colors do not matter. Note that the jokers and the cards between them do not move! If there are no cards in one of the three sections (either the jokers are adjacent, or one is on top or the bottom), just treat that section as empty and move it anyway. The deck will now look as follows: 5C 8C JC AD 4D 7D 10D KD BJ 9C QC 2D 5D 8D JD 2C RJ AC 4C 7C 10C KC 3D 6D 9D QD 3C 6C 4. Perform a “count cut”: look at the value of the bottom card. Remove that number of cards from the top of the deck and insert them just above the last card in the deck. The deck will now look as follows: 10D KD BJ 9C QC 2D 5D 8D JD 2C RJ AC 4C 7C 10C KC 3D 6D 9D QD 3C 5C 8C JC AD 4D 7D 6C 5. Finally, look at the value of the card on the top of the deck. Count down that many cards. (Count the top card as number one.) If you hit a joker, ignore it and repeat the keystream algorithm. Otherwise, use the value of the card you counted to as the next keystream value. Note that this step does not modify the state of the deck. In our example, the top card is a 10 of diamonds which has value 23. By counting down to the 24th card we find the Jack of clubs which has value 11. Hence, 11 would be the first keystream value generated by our deck. Instructions and Starter Code As mentioned in the section before we will use a circular doubly linked list to represent a deck of cards. The starter code contains two files with five classes which are as follows: • Deck – This class defines a deck of cards. Most of your work goes into this file. This class contains three nested classes: Card, PlayingCard, and Joker. • SolitaireCipher – This class represents a stream cipher that uses the Solitaire algorithm to generate the keystream and then encode/decode messages. Please note that we defined all the members of the classes public for testing purposes. In reality, for better coding style, most of those methods and all of the fields should have been kept private. Methods you need to implement For this assignment you need to implement all of the methods listed below. See the starter code for the full method signatures. Your implementations must be efficient. For each method below, we indicate the worst case run time using O() notation. 5 • Deck.Deck(int numOfCardsPerSuit, int numOfSuits) : creates a deck with cards from Ace to numOfCardsPerSuit for the first numOfSuits in the class field suitsInOrder. The cards should be ordered first by suit, and then by rank. In addition to these cards, a red joker and a black joker are added to the bottom of the deck in this order. For example, with input 4 and 3, and suitsInOrder as specified in the file, the deck contains the following cards in this specific order: AC 2C 3C 4C AD 2D 3D 4D AH 2H 3H 4H RJ BJ The constructor should raise an IllegalArgumentException if the first input is not a number between 1 and 13 (both included) or the second input is not a number between 1 and the size of the class field suitsInOrder. Remember that a deck is a circular doubly linked list so make sure to set up all the pointers correctly, as well as the instance fields. • Deck.Deck(Deck d) : creates a deck by making a deep copy of the input deck. Hint: use the method getCopy from the class Card. Disclaimer: this is not the correct way of making a deep copy of objects that contain circular references, but it is a simple one and good enough for our purposes. • Deck.addCard(Card c) : adds the input card to the bottom of the deck. This method runs in O(1). • Deck.shuffle() : shuffles the deck. There are different ways of doing this, but for this assignment you will need to implement an algorithm that uses the Fisher–Yates shuffle algorithm. The algorithm runs in O(n) using O(n) space, where n is the number of cards in the deck. To perform a shuffle of the deck follow the steps: – Copy all the cards inside an array – Shuffle the array using the following algorithm: for i from n-1 to 1 do j

[SOLVED] Assignment 1 comp 250 market place

For this assignment you will write several classes to simulate an online Market place. Make sure to follow the instructions below very closely. Note that in addition to the required methods, you are free to add as many other private methods as you want (no other additional method is allowed). [10 points] Write an abstract class MarketProduct which has the following private field: • A String name The class must also have the following public methods: • A constructor that takes a String as input indicating the name of product and uses it to initialize the corresponding attribute. • A final getName() method to retrieve the name of this MarketProduct. • An abstract method getCost() which takes no input and returns an int. This method should be abstract (thus, not implemented) because how to determine the cost depends on the type of product. • An abstract method equals() which takes an Object as an input and returns a boolean. This method should be abstract as well, since depending on the type of product different conditions should be met in order for two products to be considered equal. [25 points] All of the followings must be subclasses of the MarketProduct: • Write a class Egg derived from the MarketProduct class. The Egg class has the following private fields: – An int indicating the number of eggs. – An int indicating the price per dozen of these eggs. Note that the all the prices (throughout the assignment) are indicated in cents. For instance, 450 represents the amount $4.50. The Egg class has also the following public methods: – A constructor that takes as input a String with the name of the product, an int indicating the number required, and an int indicating the price of such product by the dozen. The constructor uses the inputs to create a MarketProduct and initialize the corresponding fields. – A getCost() method that takes no input and returns the cost of the product in cents. The cost is computed base on the number required and the cost per dozen. For instance, 4 large brown eggs at 380 cents/dozen cost 126 cents (the cost should be rounded down to the nearest cent). You may assume that cost of all MarketProducts fits within an int and therefore doesn’t cause overflow. – An equals() method which takes as input an Object and return true if input matches this in type, name, cost and number of eggs. Otherwise the method returns false. 3 • Write a class Fruit derived from the MarketProduct class. The Fruit class has the following private fields: – A double indicating the weight in kg. – An int indicating the price per kg in cents. The Fruit class has also the following public methods: – A constructor that takes as input a String with the name of the product, a double indicating the weight in kg, and an int indicating the price per kg of such product. The constructor uses the inputs to create a MarketProduct and initialize the corresponding fields. – A getCost() method that takes no input and returns the cost of the product in cents. The cost is computed based on the weight and the price per kilogram. For instance, 1.25 kgs of asian pears at 530 cents per kg cost 662 cents. – An equals() method just like the Egg class, which matches type, name, weight and cost. • Write a class Jam derived from the MarketProduct class. The Jam class has the following private fields: – An int indicating the number of jars. – An int indicating the price per jar in cents. The Jam class has also the following public methods: – A constructor that takes as input a String with the name of the product, an int indicating the number of jars, and an int indicating the price per jar of such product. The constructor uses the inputs to create a MarketProduct and initialize the corresponding fields. – A getCost() method that takes no input and returns the cost of the product in cents. The cost is computed based on the number of jars and their price. For instance, 2 jars of Strawberry jam at 475 cents per jar cost 950 cents. – An equals() method like in the previous classes. • Write a class SeasonalFruit derived from the Fruit class. The SeasonalFruit class has no fields, but it has the following public methods: – A constructor that takes as input a String with the name of the product, a double indicating the weight in kg, and an int indicating the price per kg of such product. The constructor uses the inputs to create a Fruit. – A getCost() method that takes no input and returns the cost of the product in cents. Since this type of Fruit is in season, its original cost should receive a 15% discount. For instance, 0.5 kgs of McIntosh apples at 480 cents per kg cost 204 cents. 4 [40 points] Write a class Basket representing a list of market products. Note that the instructions on how to implement this class are not always very specific. This is intentional, since your assignment will not be tested on the missing details of the implementation. Note though, that your choices will make a difference in terms of how efficient your code will be. We will not be deducting point for inefficient code in Assignment 1. Note once again that, you are NOT allowed to import any other class (including ArrayList or LinkedList). The class has (at least) the following private field: • An array of MarketProducts. The class must also have the following public methods: • A constructor that takes no inputs and initialize the field with an empty array. • A getProducts() method which takes no inputs and returns a shallow copy of the array (NOT a copy of the reference!) of MarketProducts of this basket (with the elements in the same order). • An add() method which takes as input a MarketProduct and does not return any value. The method adds the MarketProduct at the end of the list of products of this basket. • A remove() method which takes as input a MarketProduct and returns a boolean. The method removes the first occurrence of the specified element from the array of products of this basket. If no such product exists, then the method returns false, otherwise, after removing it, the method returns true. Note that this method removes a product from the list if and only if such product is equal to the input received. For example, it is not possible to remove 0.25 Kg of McIntosh apples for a 0.5 Kg McIntosh apples MarketProduct. After the product has been remove from the array, the subsequent elements should be shifted down by one position, leaving no unutilized slots in the middle array. • A clear() method which takes no inputs, returns no values, and empties the array of products of this basket. • A getNumOfProducts() method that takes no inputs and returns the number of products present in this basket. • A getSubTotal() method that takes no inputs and returns the cost (in cents) of all the products in this basket. • A getTotalTax() method that takes no inputs and returns the tax amount (in cents) to be paid based on the product in this basket. Since we are in Quebec, you can use a tax rate of 15%. The tax amount should be rounded down to the nearest cent. Note that Egg and Fruit are tax free, so taxes should be paid only for Jam. • A getTotalCost() method that takes no inputs and returns the total cost (in cents and after tax) of the products in this basket. • A toString() method that returns a String representing a receipt for this basket. The receipt should contain a product per line. On each line the name of the product should 5 appear, followed by its cost separated by a tab character. After all the products have been listed, the following information should appear on each line: – An empty line – The subtotal cost – The total tax – An empty line – The total cost Note that all the integer number of cents should be transformed into a String formatted in dollars and cents (you can write a helper method to do so if you’d like). If the number of cents is represented by an int that is less than or equal to 0, then it should be transformed into a String containing only the hyphen character (“-“). An example of a receipt is as follows: Quail eggs 4.00 McIntosh apples 6.12 Asian Pears 4.24 Blueberry Jam 4.75 Blueberry Jam 4.75 Subtotal 23.86 Total Tax 1.42 Total Cost 25.28 [25 points] Write a class Customer which has the following private fields: • A String name • An int representing the balance (in cents) of the customer • A Basket containing the products the customer would like to buy. The class must also have the following public methods: • A constructor that takes as input a String indicating the name of the customer, and an int representing their initial balance. The constructor uses its inputs and creates an empty Basket to initialize the corresponding fields. • A getName() and a getBalance() method which return the name and balance (in cents) of the customer respectively. • A getBasket() method which returns the reference to the Basket of the customer (no copy of the Basket is needed). • An addFunds() method which takes an int as input representing the amount of cents to be added to the balance of the customer. If the input received is negative, the method 6 should throw an IllegalArgumentException with an appropriate message. Otherwise, the method will simply update the balance and return the new balance in cents. • An addToBasket() method which takes a MarketProduct as input and adds it to the basket of the customer. This method should not return anything. • A removeFromBasket() method which takes a MarketProduct as input and removes it from the basket of the customer. The method returns a boolean indicating whether of not the operation was successful. • A checkOut() method which takes no input and returns the receipt for the customer as a String. If the customer’s balance is not enough to cover the total cost of their basket, then the method throws an IllegalStateException. Otherwise, the customer is charged the total cost of the basket, the basket is cleared, and a receipt is returned.

[SOLVED] Csci 6515 homework 2: vector space models and embeddings

In this assignment, you will be examining different ways of representing text as vectors, including sparse word vectors that you will create and off-the-shelf word2vec embeddings. As always, you are allowed to discuss the homework with other students as well as use online resources (provided you list the names of everyone you spoke with and the list of online resources you used at the top of your assignment). However, all submitted code and writing must be your own and you must understand everything you submit. In particular, you are not allowed to use Generative AI tools in completing this assignment. Potentially useful: ● scipy.spatial.distance.pdist ● Many packages: e.g., scipy, sklearn have functions for calculating cosine, or you can implement it yourself. If you use an off-the-shelf package, be sure to check that it calculates cosine as we defined in class. Part 1: Train sparse word vectors In the first part of the assignment, you will be constructing vector representations of words that model a word’s context. For a given word, w, define a context window of radius k to capture k words to the left and right of w. We wish to represent all contexts around w as a “bag of words”. Given a corpus of sentences, we construct a fixed-size vocabulary V (we will limit this to the top 1000 most frequent words). Given this corpus, create a term-term matrix, in which the rows are words in your vocabulary and the columns are context words. Your task is to transform a corpus of text into word vectors according to this context-window principle, using a context window of k=2. (1) You will use a subset of the Brown corpus to create your word vectors. The full corpus includes >1 million word tokens; however, to ease the processing for this assignment, you will use a subset of ~100,000 word tokens (brown100k.txt). Load the data, split by white space and make sure all words are lowercase, but there is no need to do any further processing. (2) Construct your vectors. Create a term-term matrix in which the rows are words in the vocabulary, the columns are context words, and the cells represent the number of times the context word occurred within k=2 words of the target word. Limit your vocabulary to the 1,000 most frequent words, but do not limit which context words are included. That is, you will only calculate vectors for 1000 words but you will use all of the words in the corpus to create the context. (a) Respond to the questions below about this set of word vectors. (i) What are the dimensions of your matrix? What determines these dimensions (i.e., why are these the dimensions)? What percentage of the matrix’s elements are 0? (ii) Pick your favorite word in the vocabulary. Show the 20 closest words to your chosen word, calculated using the cosine similarity metric. Are these what you expected? Why or why not? (iii) Which two words in your vocabulary are most distinct? Which are most similar to each other? Does this make sense to you, or are these surprising? (3) Next, read in pre-trained word2vec vectors (see Part 3 for instructions). (a) Respond to the following questions: (i) Describe the pre-trained vectors. What are their dimensions? What data were they pre-trained on? How many word tokens were they pre-trained on? (ii) Using the same word as above, show the 20 closest words to it, using cosine similarity. Has anything changed? Are these what you expected? Why or why not? (iii) Which two distinct words are most similar? Least similar? Has this changed? Is this what you expected? (iv) Which of the two embeddings seem best to you, based on what you have observed? Note: You should respond to these questions based on the Brown subcorpus, but I highly recommend constructing a toy dataset to test your code on before running it on the full data. If you have trouble running your code on the full dataset, please get in touch with me! Part 2: Skip Grams In this part of the homework, you will manually perform one step of training a word2vec embedding for a small toy example. The goal is to consolidate how negative sampling skip gram (as described in the textbook and in class) training works. We define the probability that a word, w, co-occurred with the context word, c, as follows: We will consider a toy example involving two-dimensional vectors and only one sentence: “Cat litter smells bad”. We will only consider the target word “cat” with a context window of k=1 word. The positive skip gram example (i.e., the tuple of words (target_word, context_word) that show which context word occurred in the context of our target word) is (cat, litter). For each positive skip gram, we will sample two negative skip grams, by randomly selecting two words in our vocabulary that are not ‘cat’. Let’s say we randomly sample (cat, remote) and (cat, oatmeal). This gives us the following training example, including tuples and their labels: (cat, litter) -> + (cat, remote) -> – (cat, oatmeal) -> – We start by randomly initializing our embeddings. This gives us the following initial word embeddings: wcat: [1,1] cremote : [0,1] coatmeal: [0, -1] clitter: [1,0] (Ok, it wasn’t totally random ;)). To hand in: (1) Plot these four vectors. You can use matplotlib or simply plot by hand (it doesn’t need to be super exact, just label the points with their values if you’re approximating the plot). (2) For each of the three tuples in our training example, does the model output that they are context pairs or not? Is the model correct? (3) You begin to train your model to arrive at target and context word vectors that will maximize the probability that positive examples are labeled as occurring together and maximize the probability that negative examples are labeled as *not* occurring together. You decide to use the loss function defined in (6.34 in SLP). What is the current loss value? (4) You decide to perform gradient descent to update your vectors. Step through one step of stochastic gradient descent (consider all three example tuples as one step). Set the learning rate η = 1. What are the updated vectors? Plot them. (5) For each of the three tuples and the updated vectors, does the model now make the correct prediction? (6) What is the new loss function value? (7) Describe what has changed and how this will over time work to create word embeddings that represent word similarity. Consider how the predictions, vectors, and loss have changed. Part 3: Evaluating Word2Vec In this final part of the homework, you will be evaluating word2vec vectors using two of the approaches discussed in class. You may either train up your own word2vec vectors (if you are interested, I highly recommend you do so!), or use downloaded word2vec word embeddings, which are available for download here (under pretrained word and phrase vectors) and can also be called using gensim (see pretrained models on the linked page). Report which vectors you are using: a good option is GoogleNews-vectors-negative300.bin. Note these vectors were trained in a very similar way to Part 2. Part 3a. Analogies One goal for successful embeddings is to be able to learn analogies of the form: Athens : Greece :: Baghdad : ________. Here, you will test whether the embeddings you are working with exhibit this property. To do so, write code such that given three words, it outputs its answer to the analogy. That is, it should calculate vector_embedding(Greece) – vector_embedding(Athens) + vector_embedding(Baghdad) and return the word that is closest to the resulting vector. You will find a list of questions in the analogies.txt file attached to this homework assignment. Note that there are 14 different categories, which are described on lines that start with colons. For example, the first line is “: capitals-common-countries” and the next ~500 lines give examples of this analogy type. You will test how well the vector embeddings you are working with do at predicting the correct answer. Report your model’s accuracy by question type. How does the model do? Do you notice patterns in what types of analogies it handles well and what types of analogies it handles poorly? Part 3b. TOEFL dataset Finally, you will test your embeddings on the TOEFL synonyms dataset (“Test of English as a Foreign Language”), a dataset introduced by Landauer & Dumais (1997) that is commonly used to evaluate word embeddings. This evaluation set contains 80 multiple choice questions for testing synonym knowledge. For example, Choose the synonym of: enormously. Choices: (a) tremendously (b) appropriately (c) uniquely (d) decidedly. You will write code that outputs the predicted answer for a given question. It will do so by calculating the cosine similarity between the target word and each of the four choices and outputting the word that has the highest cosine similarity. The evaluation set can be found in toefl.txt. Each line is one question and is structured such that the first word is the target/queried word (i.e., enormously), the second word is the correct answer, and the remaining words are the remaining choices. Report your overall accuracy. How does this compare to chance performance (i.e., the performance you’d expect if the model just chose an answer at random)? Are you impressed with the performance? Do you notice any patterns in what types of questions your model answers correctly or not? (It’s okay if you do not, but say so). How much time did you spend on the homework assignment?

[SOLVED] Csci 6515 homework 1: text classification

In this assignment, you will be implementing the text classification methods we have studied in class using common NLP/ML packages. This assignment uses social media data from Twitter, consisting of tweets (‘text’ column of Tweets_5K.csv) that are rated for three categories of sentiment: positive, negative, and neutral. In brief, you will be comparing the performance of Naive Bayes and Logistic Regression classifiers run on text that has or has not been preprocessed. Notes: ● You may discuss this assignment with classmates. However, all submitted code and answers must be your own work (and you must understand what you submit). Please list the names of anybody you talked to and list any online resources (e.g., Wikipedia) you used while working on this assignment. ● You may not use ChatGPT or similar to complete this assignment. ● Please hand in the code and your answers to the conceptual questions. The written answers can either be in the same .py file or in a separate pdf file, but they must be well-annotated and easy to find. ● Please keep track of roughly how much time you spent on the assignment to help me calibrate for future homeworks. ● Note that you will be working with packages in this assignment, but please make sure that you understand what the packages are doing. One important skill when working in this area is being able to Google around and look through documentation to find helpful resources/functions/packages, so that is part of this assignment. That being said, I’m also a resource, so please come to office hours or email me if you have questions. ● Some things that you may or may not find helpful as you work on this assignment: ○ DictVectorizer ○ Accuracy_score ○ Train_test_split ○ Naive Bayes ○ SpaCy ○ NLTK Part 1: Preparing Data 1. Load the data: Create (i) a list of all Tweets in the dataset, raw_tweets and (ii) a list of the sentiments, labels, corresponding to each raw tweet encoded as integers (-1 meaning negative, 0 meaning neutral, 1 meaning positive). 2. Basic preprocessing: You will start by implementing very basic preprocessing of the tweets, by only splitting the tweets on whitespace (“tokenization”). Create a list basic_preproc_tweets, which is a list of preprocessed tweets (which are now lists of words). 3. Featurize (bag of words): From your preprocessed tweets, create a bag of words matrix, basic_preproc_bow. The rows should be documents and the columns should be words (i.e., features). The cell values should be the number of times a given word occurs in a given document (with smoothing: use Laplace Add-1 smoothing). Note: You will be doing this again later in the homework. 4. Create a training set and a test set: You will need to define a training set (used to learn model parameters) and a test set (used for testing your model on unseen documents) – please use a 80%/20% split, meaning that 80% of your available data will be in the training set and the remaining 20% will be in the test set. The data are pre-shuffled, so please make the first 80% of the data the training set, and the last 20% of the data the test set. Questions to answer and hand-in (you may need to write additional code and/or print statements to answer these questions): 1. What are the dimensions of your feature matrix (X)? 2. What is the value of X[1460][1460]? 3. What does this value mean? What feature does the 1460th column represent? Part 2: Implementing Naive Bayes 1. Implement and run Naive Bayes: You may do so by hand, if you choose, or using scikit-learn, a commonly used package for machine learning. You can read about implementing Naive Bayes here to find out specifically what calls you should use to create your classifier, train your classifier, and use your trained classifier to make predictions on your test set of unseen data. Report the model’s accuracy on the unseen test set. How does this compare to a classifier that always outputs the most frequent category in the training set? Questions to answer and hand in: 1. Calculate whether the Naive Bayes classifier you trained would classify the following tweet as positive, neutral, or negative: “Happy birthday Annemarie”. You should do this by hand and show your work, but you can use code to get the relevant probabilities (basically, the goal of this question is to make sure that you understand how Naive Bayes works). 2. Report the model’s accuracy on the unseen test set. How does this compare to a classifier that always outputs the most frequent category in the training set (what is that classifier’s accuracy)? Part 3: Implementing Logistic Regression 1. Implement and run Logistic Regression: Now, use scikit-learn to implement Logistic Regression for sentiment analysis on the Twitter dataset. Please set max_iter = 150. Notice that the calls are similar to those you used for Naive Bayes, one of the benefits of using such packages. Report the model’s accuracy on the unseen test set. How does this compare to a classifier that always outputs the most frequent category in the training set, as well as the Naive Bayes classifier. Questions to answer and hand in: 1. In class, we learned how to do binary logistic regression between 2 options. Here, there are three possible classifications, so sklearn uses a “one vs. rest” scheme where it learns a binary logistic regression model for each possible label (i.e., one logistic regression which learns to separate positive from non-positive tweets, a second that learns to separate negative from non-negative tweets, and a final logistic regression that learns to separate neutral from non-neutral tweets. How many parameters did your multiclass logistic regression model learn? 2. Report the model’s accuracy on the unseen test set. How does this compare to a classifier that always outputs the most frequent category in the training set and the Naive Bayes classifier from Part 2? Part 4: Implementing more elaborate pre-processing In this section, you will test the effect of implementing more extensive preprocessing, but otherwise keep the above workflow the same. Your new preprocessing should do the following in addition to tokenization: ● Lowercasing ● Lemmatization ● Removing stop words ● Removing punctuation and extra white space ● Use only the top 1000 most frequent words, and replace the rest with OOV (i.e., your final feature matrix should have 1001 columns – 1000 most frequent words and one OOV token). ● Replace numbers with NUM Note: Think carefully about how each step affects the next one in the pipeline and implement these in the order that makes sense and is conceptually most likely to improve results. As an example, your final feature matrix should have 1001 columns – 1000 most frequent words and one OOV token, so you will need to be careful when you remove stop words (as these are often the most frequent words). You can implement this however you would like, but I will point you to the spaCy or nltk packages, which are popular in NLP. Questions to answer: 1. How does this impact the performance of your Naive Bayes and Logistic Regression classification results? Part 5: Your turn! Questions to answer: 1. Propose and implement at least one addition to the workflow that you think will improve performance (but please still use Naive Bayes or Logistic Regression). Specifically, you can add a pre-processing step, add a feature (or class of features), etc. Explain why you chose this feature and why you think it might help performance. Report the classifier’s accuracy on the test set. Did this help as expected? Note: You will not be graded on whether your change actually improved performance. 2. Finally, take the best performing model and look at the tweets that were incorrectly categorized. Do you observe any patterns in what the model is making mistakes on? What do you think could be done to further improve results? Provide a sample of tweets as well as their true label and their predicted label to justify your answers. Note: You do not need to know how to implement the proposed change. 3. Roughly how much time did you spend on this homework?

[SOLVED] Cs-ug 3224 introduction to operating systems assignment 7 (10 points)

Write a program that uses a multi-threaded (for speedup) Monte-Carlo simulation to estimate the probability of two students in our class having the same birthday. Your program’s main routine shall create a number of worker threads NUM_THREADS=4 (defined as a macro) in order to speedup the computation (you may name the common routine “void* WorkerThread(void* pvar)” and use a shared variable (an integer) as well as synchronization primitives (e.g. a semaphore or mutex). Each thread shall perform a number of trials NUM_TRIALS=100,000 (defined as a macro), in which it creates a list of n random numbers, passed to your program as a parameter (in our case, test with n=23 and n=49), each has a value between 0 and 364 representing each person’s birthday within the year. If (at least) two of them coincide, the thread increments a shared variable nhits by 1 (for each trial). The shared variable holds the total number of times the experiments/trials succeeded (i.e. 2 or more students had a matching birthday). Thus the total number of trials is NUM_THEADS*NUM_TRIALS. When all threads complete, we shall calculate and output the probability as: nhits/(NUM_THREADS*NUM_TRIALS) For a class of 49 students, it should be 97%, whereas for 23 students, it should be about 50% (https://en.wikipedia.org/wiki/Birthday_problem). After verifying the correct operation, use your program to compute the probability for classes of size 41 and 128 students. Some useful notes: • Use the man pages for more info on how to use a semaphore or a mutex. • Each thread must seed the random number and use its own state so it won’t match the other thread’s state, and thus each thread shall have a different sequence of random numbers, for example: unsigned int rand_state = (unsigned int) time(NULL) + pthread_self(); • You need to use rand_r() to generate the random numbers and not rand(), for example: rand_r(&rand_state) • Note that you will need to use the -pthread option with gcc in order to link the pthread library. What to submit: Please submit the following files individually: 1) Source file(s) with appropriate comments. The naming should be similar to “lab#_$.c” (# is replaced with the assignment number and $ with the question number within the assignment, e.g. lab4_b.c, for lab 4, question c OR lab5_1a for lab 5, question 1a).2) A single pdf file (for images + report/answers to short-answer questions), named “lab#.pdf” (# is replaced by the assignment number), containing: • Screen shot(s) of your terminal window showing the current directory, the command used to compile your program, the command used to run your program and the output of your program. 3) Your Makefile, if any. This is applicable only to kernel modules. RULES: • You shall use kernel version 4.x.x or above. You shall not use kernel version 3.x.x. • You may consult with other students about GENERAL concepts or methods but copying code (or code fragments) or algorithms is NOT ALLOWED and is considered cheating (whether copied form other students, the internet or any other source). • If you are having trouble, please ask your teaching assistant for help. • You must submit your assignment prior to the deadline.

[SOLVED] Cs-ug 3224 introduction to operating systems assignment 6 (10 points)

1) (8 points) Repeat assignment 5B, except that you shall now use a TCP/IP socket for communicating between the processes instead of a pipe. Use the following socket functions in their default mode. You may use the man command in your Linux virtual machine for information about the parameters: CLIENT SERVER socket() – opens a socket (similar to pipe() ) socket() connect() – connects to a server bind() – assigns a particular port number to the server listen() – listens to connection requests from clients accept() – accepts a connection from client read() – reads a buffer from the socket, just as in file or pipe reading write() – writes a buffer to the socket, just as in file or pipe writing close() – closes the socket close() You shall use sockets of type SOCK_STREAM and assign the parent (consumer) as the client and the child (producer) as the server. Insert an initial random wait (1 to 3 seconds) at the child process (but not the parent) prior to it starting to listen and accept connections. The parent process (client) may thus fail to connect if it tries to do so before the child process (server) has started to listen (which is after the random wait). As such, you should insert a loop in the parent that repeatedly attempts to connect, waiting 100 ms between attempts, till it succeeds, eventually. 2) 2 points): Answer the following for part 2: a. Which of the calls above are blocking and which are not? Explain what that means? b. Is this a form of direct communications or indirect communications? c. What is the failure flag returned from connect() that indicates the server is not ready? d. How would you change your program to communicate between processes in different machines? What to submit: Please submit the following files individually: 1) Source file(s) with appropriate comments. The naming should be similar to “lab#_$.c” (# is replaced with the assignment number and $ with the question number within the assignment, e.g. lab4_b.c, for lab 4, question c OR lab5_1a for lab 5, question 1a).2) A single pdf file (for images + report/answers to short-answer questions), named “lab#.pdf” (# is replaced by the assignment number), containing: • Screen shot(s) of your terminal window showing the current directory, the command used to compile your program, the command used to run your program and the output of your program. 3) Your Makefile, if any. This is applicable only to kernel modules. RULES: • You shall use kernel version 4.x.x or above. You shall not use kernel version 3.x.x. • You may consult with other students about GENERAL concepts or methods but copying code (or code fragments) or algorithms is NOT ALLOWED and is considered cheating (whether copied form other students, the internet or any other source). • If you are having trouble, please ask your teaching assistant for help. • You must submit your assignment prior to the deadline.

[SOLVED] Cs-ug 3224 introduction to operating systems assignment 4 (10 points)

A) (2 points) If you create a main() routine that calls fork() three times, i.e. if it includes the following code: pid_t x=-11, y=-22, z=-33; x = fork(); if(x==0) y = fork(); if(y>0) z = fork(); Assuming all fork() calls succeed, draw a process tree similar to that of Fig. 3.8 (page 116) in your text book, clearly indicating the values of x, y and z for each process in the tree (i.e. whether 0,-11,-22,-33, or larger than 0). Note that the process tree should only have one node for each process and thus the number of nodes should be equal to the number of processes. The process tree should be a snapshot just after all forks completed but before any process exists. Each line/arrow in the process tree diagram shall represent a creation of a process, or alternatively a parent/child relationship. B) (4 points) Write a program that creates the process tree shown below: C) (4 points) Write a program whose main routine obtains a parameter n from the user (i.e., passed to your program when it was invoked from the shell, n>2) and creates a child process.The child process shall then create and print a Fibonacci sequence of length n and whose elements are of type unsigned long long. You may find more information about Fibonacci numbers at (https://en.wikipedia.org/wiki/Fibonacci_number). The parent waits for the child to exit and then prints two additional Fibonacci elements, i.e. the total number of Fibonacci elements printed by the child and the parent is n+2. Do not use IPC in your solution to this problem (i.e. neither shared memory nor message passing). What to hand in (using Brightspace): Please submit the following files individually: 1) Source file(s) with appropriate comments. The naming should be similar to “lab#_$.c” (# is replaced with the assignment number and $ with the question number within the assignment, e.g. lab4_b.c, for lab 4, question c OR lab5_1a for lab 5, question 1a). 2) A single pdf file (for images + report/answers to short-answer questions), named “lab#.pdf” (# is replaced by the assignment number), containing: • Screen shot(s) of your terminal window showing the current directory, the command used to compile your program, the command used to run your program and the output of your program. 3) Your Makefile, if any. This is applicable only to kernel modules. RULES: • You shall use kernel version 4.x.x or above. You shall not use kernel version 3.x.x. • You may consult with other students about GENERAL concepts or methods but copying code (or code fragments) or algorithms is NOT ALLOWED and is considered cheating (whether copied form other students, the internet or any other source). • If you are having trouble, please ask your teaching assistant for help. • You must submit your assignment prior to the deadline.

[SOLVED] Cs-ug 3224 introduction to operating systems assignment 3 (10 points)

Develop a simple Linux kernel module that runs on your virtual machine. The only functionality required of your module is to be able to load and unload, printing a debug message while doing so. When a Linux kernel module is loaded, it invokes an init function, and when it is removed (or unloaded), it invokes an exit function. Please consult the freely available O’Reilly book “Linux Device Drivers, 3rd Edition” (https://lwn.net/Kernel/LDD3/), in particular p.16, as well as your text book p.96 to get you started. Note that even though the LDD3 book is written for kernel version 2.6, most mechanisms are applicable with minor or no changes. The relevant example code is copied below as a starting point. #include #include MODULE_LICENSE(“Dual BSD/GPL”); static int hello_init(void) { printk(KERN_ALERT “Hello, world ”); return 0; } static void hello_exit(void) { printk(KERN_ALERT “Goodbye, cruel world ”); } module_init(hello_init); module_exit(hello_exit); The hello_init() function is invoked when you insert your module (using the insmod shell command), whereas the hello_exit() is called when you unload your module (using the rmmod shell command). In addition, you may also need to slightly modify the Make file provided in the book to suit your setup. Modify this module such that: 1) The init function prints the tick time in milliseconds (i.e. the timer interval, as we defined it in weeks 1/2) after the hello message, 2) The exit function prints a goodbye message and the time between the insertion and removal of the module i.e. between init and exit functions) using two different methods: a. Using the difference in the value of jiffies from inserting the module to removing the module (HINT Hint: Search for “jiffies” and “HZ” in the O’Reilly book) b. Using the time difference obtained by reading the timer (Hint: use ktime_get_boottime(void), more documentation may be found at https://www.kernel.org/doc/html/latest/coreapi/timekeeping.html).Hints: • Your module should use printk() to print messages. You will use this print facility to also debug your code if needed ( ). More information may be found on https://www.kernel.org/doc/html/latest/coreapi/printk-basics.html • Unless the message level you pass to printk() is of higher priority (i.e. lower value) than the console’s log level (i.e. threshold), printk()will print to a log buffer instead of your shell’s console. o In such a case, you shall use dmesg shell command to view messages printed by printk(), e.g.: dmesg o You may clear the log using: dmesg -C • You may modify the printk() behavior by either changing the message log level (i.e. the first byte you pass to printk() ) OR by changing the console’s threshold using: dmesg – n 8 // prints everything to console • You may use the Makefile provided in the O’Reilly book, but you may need to install the kernel headers prior to using it if not already installed: sudo apt-get install linux-headers-$(uname -r) Submission file structure: Please submit a single .zip file named [Your Netid]_lab#.zip. It shall have the following structure (replace # with the actual assignment number): └── [Your Netid] hw# (Single folder includes all your submissions) ├── lab#_1.c (Source code for problem 1) ├── lab#_2a.c (Source code for problem 2a, and so on) ├── lab#_1.h (Source code header file, if any) ├── Makefile (makefile used to build your program, if any) ├── lab#.pdf (images + Report/answers to short-answer questions) What to hand in (using Brightspace): • Source files (.c or .h) with appropriate comments. • Your Makefile if any. • A .pdf file named “lab#.pdf” (# is replaced by the assignment number), containing: o Screen shot(s) of your terminal window showing the current directory, the command used to compile your program, the command used to run your program and the output of your program. RULES: • You shall use kernel version 4.x.x or above. You shall not use kernel version 3.x.x. • You may consult with other students about GENERAL concepts or methods but copying code (or code fragments) or algorithms is NOT ALLOWED and is considered cheating (whether copied form other students, the internet or any other source). • If you are having trouble, please ask your teaching assistant for help. • You must submit your assignment prior to the deadline.

[SOLVED] Cs-ug 3224 introduction to operating systems assignment 2 (10 points)

Develop a C program, “mycopy”, whose main routine accepts two input parameters from the user, an input file name and an output file name (both passed to your program when it was invoked from the shell). Both files are text files. Your program shall then create an output file and print the user’s full name, followed by the user’s ID (an integer), followed by a newline and then copy the contents from the input file into the output file. It is okay to hardcode your name as a string literal in your code, but for the user ID, you shall use a system call that gets it for you. Use Unix calls for opening, closing, reading and writing files, not the standard C calls. Below is an example of how your program should be invoked from the shell: mycopy input.txt output.txt Where input.txt is a file that already exists (you may create an input file with a few lines to test your code). If a path is not provided in filenames, then it’s assumed that a file is located at the same directory as the working directory of your program, i.e. the directory where your program was invoked from. After developing your program, invoke using strace and then answer the following questions: 1) What are the system call names for getting the process’ userID, opening a file, closing a file, reading a file and writing a file? 2) How many system calls are (i.e. the count) involved with opening a file, closing a file, reading a file and writing a file? (count each individually. You may either use strace options to aid you in doing so, or you may use grep). 3) What was the value of the file descriptor of your read file? Should we expect it to change if you change the order of opening the input and output files? 4) What was the value of the file descriptor of your write file? Should we expect it to change if you change the order of opening the input and output files? Notes and hints: • Please include your answers, the strace log in your submitted .pdf file. • Create a text file and use it to test your program, e.g. type: touch input.txt echo “Hello world” > input.txt echo “This is lab 2” >> input.txt • Use the man pages to learn how to use POSIX API library functions (and the necessary include files) and/or UNIX commands and its various optional arguments (e.g. strace, especially for counting), e.g.:man strace // gets info from section 1, user’s manual man getuid.2 // section 2 is programmer’s manual Submission file structure: Please submit a single .zip file named [Your Netid]_lab#.zip. It shall have the following structure (replace # with the actual assignment number): └── [Your Netid] hw# (Single folder includes all your submissions) ├── lab#_1.c (Source code for problem 1) ├── lab#_2a.c (Source code for problem 2a, and so on) ├── lab#_1.h (Source code header file, if any) ├── Makefile (makefile used to build your program, if any) ├── lab#.pdf (images + Report/answers to short-answer questions) What to hand in (using Brightspace): • Source files (.c or .h) with appropriate comments. • Your Makefile if any. • A .pdf file named “lab#.pdf” (# is replaced by the assignment number), containing: o Screen shot(s) of your terminal window showing the current directory, the command used to compile your program, the command used to run your program and the output of your program. RULES: • You shall use kernel version 4.x.x or above. You shall not use kernel version 3.x.x. • You may consult with other students about GENERAL concepts or methods but copying code (or code fragments) or algorithms is NOT ALLOWED and is considered cheating (whether copied form other students, the internet or any other source). • If you are having trouble, please ask your teaching assistant for help. • You must submit your assignment prior to the deadline.

[SOLVED] Introduction to operating systems (cs-ug 3224) assignment 1 (5 points)

In this assignment, you are required to download and install the latest VMware Workstation Player from www.vmware.com and create a new virtual machine after downloading the latest Ubontu Linux distribution from www.ubunto.com. If you already have an Ubunto linux machine, then you may use it, however it is recommended that you use a virtual machine for assignments that pertain to developing kernel modules. Please note that we will develop Linux kernel modules in this class, and as such Mac OS will not do (besides, it behaves differently from Linux when used with pthreads). After successfully installing and running Linux, use one of the pre-installed editors (e.g. vi, gedit, emacs, etc), or download an editor of your choice, to write a C program that prints the text “Hello world! This is CS3224, Fall 2024!” on the first line, and then on the next line, prints the student’s first/last names on the second line, followed by a random number whose value is between 0-199 (please ensure you seed your rand() properly, e.g. use srand(time(NULL)) to seed). Your program shall then put a new-line character and then exit. You should use gcc for compiling your program. You should name your output file (i.e. the executable) lab1 (yes, no extension). Below are the links for the free versions (for student use) of Vmware workstation (windows/linux) and Fusion (Mac). https://www.vmware.com/products/workstation-player/workstation-player-evaluation.html https://customerconnect.vmware.com/web/vmware/evalcenter?p=fusion-player-personal Submission file structure: Please submit a single .zip file named [Your Netid]_lab#.zip. It shall have the following structure (replace # with the actual assignment number): └── [Your Netid] hw# (Single folder includes all your submissions) ├── lab#_1.c (Source code for problem 1) ├── lab#_2a.c (Source code for problem 2a, and so on) ├── lab#_1.h (Source code header file, if any) ├── Makefile (makefile used to build your program, if any) ├── lab#.pdf (images + Report/answers to short-answer questions)What to hand in (using Brightspace): • Source files (.c or .h) with appropriate comments. • Your Makefile if any. • A .pdf file named “lab#.pdf” (# is replaced by the assignment number), containing: o Screen shot(s) of your terminal window showing the current directory, the command used to compile your program, the command used to run your program and the output of your program. RULES: • You shall use kernel version 4.x.x or above. You shall not use kernel version 3.x.x. • You may consult with other students about GENERAL concepts or methods but copying code (or code fragments) or algorithms is NOT ALLOWED and is considered cheating (whether copied form other students, the internet or any other source). • If you are having trouble, please ask your teaching assistant for help. • You must submit your assignment prior to the deadline.