Assignment Chef icon Assignment Chef

Browse assignments

Assignment catalog

33,401 assignments available

[SOLVED] Cs382 –

CS 382 Computer Architecture1 Task: Calling Convention In this task you will practice calling convention and creating procedures. Things can be very tricky in this lab so be careful! You will write two procedures in assembly: _uppercase() and _toupper() : char _uppercase(char lower); receives a single character in lower case, convert it to upper case, and return this new character; int _toupper(char* string); receives the address of a string, and convert all characters to upper case by calling _uppercase() , and return the number of characters converted. Note this function will convert the characters in place: it will replace the lower case in the old string with upper case, instead of creating a new string. Using the data in the starter code, the program should print the following:You can assume the source string always contains and only contains alphabetic lowercase letters. Be Careful… When loading a character, or a byte, into a register, the instruction is LDRB or LDRSB , and the destination register is Wt not Xt . This is the same when you are storing a character back: the instructions are STRB and STRSB . Requirements Note your code is a complete assembly program (not just a sequence of instructions). It should be able to assemble, link, and execute without error and warnings. When executed, the program must finish without problems. If your code cannot assemble, link, and/or execute, you get no credit; You must create procedures correctly, meaning following the calling convention discussed in the textbook and class; You must not use any C library functions other than printf() . When using printf() , you must use outstr defined in the starter code without changing it; You must not hard code any length-related variables; You do not need to write comments on every line, but you’re strongly encouraged to do so; 2 Grading The task can be very easy without procedure calls, but getting familiar with procedures and calling conventions are important and therefore exactly the purpose of this lab. We take the proper creation of procedures as equally important as getting the correct output on screen. 1 The lab will be graded based on a total of 10 points. The following lists deductibles, and the lowest score is 0 – no negative scores: the code does not assemble, or the program terminates abnormally/unsuccessfully; the code is generated by compiler or AI; the code cannot be explained clearly in person; did not call _uppercase() in _toupper() and/or _toupper() in the main procedure; used other external functions other than printf() ; -5: did not call printf() in the main procedure to print expected result; -5: the procedure was not created properly and/or didn’t follow calling convention. -5 points for each of the two procedures; -5: the procedure did not return any value; -5 points for each of the two procedures; -5: declared/hardcoded any data that represents the length of the string; -3: the string is incorrectly modified or not modified at all; -2: the return value is wrong; -1: no pledge and/or name.Attendance: check off at the end of the lab to get attendance credit.2

$25.00 View

[SOLVED] Cs382 –

CS 382 Computer Architecture1 Task 1: Calculating Dot Product The .data segment must be declared as follows:where vec1 and vec2 are two vectors, and dot is where we store the dot product result. You must store the dot product into variable dot . There’s no need to use loops; you can just hard code the offsets for now. You can always assume the vector length is 3. Requirements Note your code is a complete assembly program (not just a sequence of instructions). It should be able to assemble, link, and execute without error and warnings. When executed, the program should finish without problems (also without any outputs); If your code cannot assemble, you get no credit – this is the same as C programs that cannot be compiled; MUL instruction can be used for multiplications; Avoid using registers X29 and X30; You must store the dot product result into the variable dot ; You have to put comments on each line of instruction; 2 Task 2: Debugging Assembly Using gdb To check if our programs are correct, we would have to rely on gdb (sorry, still not printf() yet!). A very comprehensive tutorial of using gdb to debug assembly programs is in Appendix B.3 of the textbook. Read through the section before you start this task. In this task, you’d need to write a report on using gdb to debug task 1. You need to provide sufficient screenshots of gdb to show that your program is correct. Step into gdb and use commands to show that the result is correct. Requirements Simply one screenshot of showing the final result is not sufficient. For each step you took and command you typed on gdb, you need a screenshot, and explain what you’re trying to accomplish at that step. For example, setting a break point needs one; stepping into an instruction needs one, and so on; 1 You must use the correct command to show directly that the dot-product calculation is correct and is stored back to memory; The screenshots must not be pictures taken from your phone or camera; 3 Grading The lab will be graded based on a total of 10 points, 5 for task 1 and 5 for task 2. The following lists deductibles, and the lowest score is 0 – no negative scores: Task 1: • -3: the calculation result of dot-product is wrong; • -3: the calculation result of dot-product is not in dot variable; • -1: one or more instructions is missing comments; • -1: the program has any type of output on terminal when executing; • -1: no pledge and/or name. Task 2: • -5: the report is not in PDF format; • -2: the screenshots are not taken directly from the laptop; • -2: missing screenshot and/or explanation of one or more steps in debugging; • -2: not showing the final value of dot-product calculation in gdb in memory; • -1: no pledge and/or name in the report. General deductions (only deduct once): -10: the code in task 1 does not assemble, or the program terminates abnormally/unsuccessfully; does not attempt the task; is generated by a compiler; cannot be explained clearly in person.Attendance: check off at the end of the lab to get attendance credit.2

$25.00 View

[SOLVED] Cs382 – lab 4: gdb

By: CS 382 CAsStarting GDB with your program● Install gdb-multiarch package: ○ sudo apt-get install gdb-multiarch ● Assemble using –g flag and then link ○ aarch64-linux-gnu-as demo.s -g -o demo.o ○ aarch64-linux-gnu-ld demo.o ● Run your program and wait for GDB to connect using the `–g 1234` flag ○ qemu-aarch64 -g 1234 a.out ● On another terminal window, run gdb and connect to the program ○ gdb-multiarch –nh -q a.out -ex ‘set disassemble-next-line on’ -ex ‘target remote :1234’ -ex ‘set solib-search-path /usr/aarch64-linux-gnu-lib/’ -ex ‘layout regs’ Note: You can find these steps in section B.3.2 of the textbook, backslashes in the final command simply denote new lines. Interacting with GDB● Please read section B.3.3 in the textbook (p.184) ● Breakpoints ○ Use b to pause the program when it reaches the label ○ Ex: b _start to pause at the start of the program ● Moving through the program ○ Resume execution using continue or c ○ Step through the program using step or s ● Panel focus ○ Use focus regs to view the values of the registers ○ Use focus asm to go back to the assembly code panel Printing Memory● Read section B.3.4 in the textbook ● To print data stored in memory we use the following command: ○ x/ address ● If we wanted to print 5 bytes in character from the label hello: ○ x/5cb &hello ● To print 2 bytes in decimal from the address stored in x10: ○ x/2db $x10Starter Code Task Time ● Come up for attendance before leaving ● Task 1: Calculating Dot Product ○ Use the data in “vec1” and “vec2” to calculate the dot product and store it in “dot” ○ Must be able to assemble, link, and execute without error ○ You are allowed to use the MUL instruction as needed ○ Comment every line ● Task 2: Debugging using GDB ○ Look at appendix B.3 ○ Write a report about your program from Task 1

$25.00 View

[SOLVED] Cs382 – name: breona pizzuta

Partner (if any): Ben Carpenter Pledge: “I pledge my honor that I have abided by the Stevens Honor System.” CS 382 Lab 4 Task 2 Start by using the b _start command:first 19 in firstand add the product so farnext 2 in 2ndof the second vector and add the product so farnext 2 in 3rdof the third vector and add the product so faruntil line 40: Use x/1dg &dot: The result of the dot product should be stored here. We see it prints 140 which is the dot productto end of program: Ends program

$25.00 View

[SOLVED] Cs382 –

CS 382 Computer ArchitectureIn this lab, we’re going to write and execute assembly programs for the first time. 1 Task 1: Install QEMU Emulator Recall that assembly programs are closely related to the hardware platform they are performed on, so an assembly program written in ARM syntax cannot be executed on any other machines. Most of the time, however, we also want to execute ARM assembly on some other machines. In this case, we can install an emulator that can simulate the hardware execution of ARM machines on any type of machine. This is called cross-compilation. The emulator we are using is called QEMU. To install QEMU on your virtual machine, type the following in your terminal:Once you have finished the installation, you need to write the simplest ARM assembly program (the one listed in B.1.1 in textbook), and assemble and execute it. Upon success, there should be nothing printed out — no warnings or any type of output. To learn how to assemble, link, and execute an assembly program, read B.2.1 in textbook. To get credits for attendance, show your CA that you have installed QEMU, and can assemble/link/execute the simplest ARM program. No submission needed. 2 Task 2: Write a Simple Program One very common thing we do in programming is to print something to the screen. Although printing numerical numbers takes a bit more work, it is relatively much easier to print a string. In this task, you will learn how to declare a string, and how to invoke system call to print the string out. The code listed in B.1.1 in textbook is your starter code. Please learn how to declare data and load address by reading B.1.3 in textbook before starting this task. Unlike high-level languages, in assembly, if we want to print a string, we have to declare it first in the .data segment:To print this string out, we need to set several registers to correct values before invoking the system call, because the system needs to retrieve the information about this string from these registers: Regis ter Content X0 Destination (for printing, its value is 1) X1 The address of the string to be printed X2 The length of the string to be printed X8 System call number (for printing, its value is 64) 1 Once these registers are ready, we can invoke the system call by using instruction: SVC 0 , and the string will be printed out! Lastly, to terminate the program successfully, we need the following instructions:Requirements Note your code is a complete assembly program (not just a sequence of instructions). It must be able to assemble, link, and execute without error and warnings. When executed, the program must finish without problems; If your code cannot assemble, you get no credit – this is the same to C programs that cannot be compiled; You must declare the length of the string as a quadword in the .data segment; You must put comments on every instruction you wrote; no need to comment on labels and directives; 3 Grading Task 2 will be graded based on a total of 10 points. The following lists deductibles, and the lowest score is 0 – no negative scores: the code does not assemble, or the program terminates abnormally/unsuccessfully; the code is generated by compiler; -2: the length of the string is not declared as a quad data in the .data segment; -2: one or more instructions is missing comments; -1: no pledge and/or name.Attendance: show your CA completed task 1 and check off at the end of the lab to get attendance credit.2

$25.00 View

[SOLVED] Cs382 –

CS 382 Computer Architecture1 Task: An Array of Nibbles In this lab, we’re going to create an array of nibbles. Say we have an integer array: int arr[2] = {0xEFF2, 0x9812} we can use bit-wise operations and shifting to create an array like this: unsigned char nibs[16] = {0,0,0,0,0xE,0xF,0xF,0x2,0,0,0,0,0×9,0x8,0x1,0x2} . Note how many 0 are there in nibs : each integer takes four bytes which is eight nibbles, so you need to make sure leading zeros are also considered in the array. Since there’s no data type that contains only four bits, we use unsigned char as a substitute. The function you are going to implement is declared as follows:where intarr is the integer array, nint the number of integers in that array, nibarr the array of nibbles that you’re going to fill in, and nnibs the size of that array. You can assume both nint and nnibs are correct.1 Requirements Your code must be able to compile successfully and executed without segmentation fault or any other type errors; as comments; You must not change the function declaration of int_to_nibble() . 2 Grading The lab will be graded based on a total of 10 points. the code does not compile, or executes with run-time error; -5: included other header files, and/or the starter code was changed (except main() ); -5: the result is incorrect; -3: leading zeros in the integer numbers are not stored as nibbles; -1: no pledge and/or name(s) in C file.Attendance: check off at the end of the lab to get attendance credit.2

$25.00 View

[SOLVED] Cs382 –

CS 382 Computer Architecture1 Task: Display a Binary Integer In this lab, we’re going to manipulate binary numbers in a C program. More specifically, we’re going to write a C program that can print out the binary pattern of any 32-bit integer numbers. We have provided a start code for you: where display_32() is the function you need to complete. Here we use int32_t and int8_t as substitutes algorithm would be extracting every bit of the number using bit-wise operations and shifting, while calling display() to print out one bit. An example of output from the code above would be:Notice two things: You need to output all 32 bits with leading zeros; MSB is the leftmost bit while LSB the rightmost, so you need to print out MSB first, and LSB last.Requirements Your code must be able to compile successfully and executed without segmentation fault or any other type errors; own tests; 1 You must not use division or multiplication in any part of your code (addition and subtraction are allowed, though); only use shifting ( > ) and bit-wise operators ( & and/or | ) to extract individual bits; You can create any functions that can help you, but you must call display() and display_32() functions. Also, you wouldn’t need to include any more header files; All 32 bits must be printed out; MSB is the leftmost bit, while LSB the rightmost.2 Grading The lab will be graded based on a total of 10 points. the code does not compile, or executes with run-time error; if used multiplication and/or division and/or modulo operators; -5: display() and/or display_32() are not used; -5: included other header files, and/or the starter code was changed (except main() ); -5: no display of binary number and/or the result is incorrect; -3: negative numbers are not displayed correctly; -3: leading zeros are not printed out; -3: the binary number is printed in the reverse order (i.e., MSB is the right-most); -1: no pledge and/or name in C file.Attendance: check off at the end of the lab to get attendance credit.2

$25.00 View

[SOLVED] Cs382 –

CS 382 Computer ArchitectureRead Before Start For each of the tasks in this homework, you are provided two files. One is _data.s , where .data and .bss segments are stored. You are free to change the value of the variables declared there, but you must not change the label names, and you must not add any new data/instruction there. The other one is .s where .text segment is provided, along with some starter code. Only write assembly instructions at the speicified place, and do not modify any existing code there. If you need to add new data in your program, feel free to declare another .data segment at the bottom of this file (not the _data.s file!!). When you are asked to print something out, do not use printf() or any functions from stdio.h , as the tester will not be able to capture your output, resulting in failing all test cases. To test your code on your own, we take task 1 for an example, where copystr.s and copystr_data.s are provided:We also provided a tester file that can run multiple tests on your assembly program. This is also the tester we are going to use when grading your homework. Note that any violation to the conditions mentioned will likely crash the tester program, so please do follow the instructions. 1 Task 1 (20 pts): Copy a String (Again) In this task, you will write an assembly code that completes the same task as in previous homework, i.e., copy string src_str to another string dst_str . You can assume dst_str is always large enough to store all characters copied there. After copying the string, please use system call to print the string dst_str out to terminal. Requirements You must use loops or recursion. If you want to use recursion, you must follow calling conventions and manage stack frames; You must not declare, or hardcode variables that represent string length; You must not use any external libraries and functions; You must use system call to print, not printf() ; Write your name and pledge at the top of the code.2 Task 2 (50 pts): Binary Search In this task, you’ll implement a binary search algorithm in ARM assembly. An example of .data segment is provided to you in bins_data.s , which includes a double word array (sorted), the length of the array, the target value we want to find, and output messages. You can assume the numbers are signed, and the array is already sorted. Again, if you need to declare additional data, you must add another .data segment inside file bins.s , not bins_data.s . After the search, you need to print the messages correctly with the target value using system calls. For example,You need to make sure the code will exit successfully without any errors after printing out the messages. Requirements Your algorithm must be binary search, of course; You must not hard code array length in your code, so you should always use length in the .data segment; You must use loops or recursion. If you want to use recursion, you must follow calling conventions and manage stack frames; You must not use any external libraries and functions; You must use system call to print, not printf() ; Write your name and pledge at the top of the code.3 Task 3 (30 pts): Converting String to Integer In this task, you’ll write an assembly code to convert a string to an integer. For example, say a string is declared in the .data segment:Then your program will convert the string into an integer 382 , and store it to number . You don’t need to consider negative numbers. Just a refresher: if the string is “9082” , the number can be calculated by 9×103+0×102+8×101+2×100. Be Careful… The characters in a string are stored as their ASCII values, not the real digit; When loading a character, or a byte, into a register, the command is LDRB or LDRSB , and the destination register is Wt not Xt . Requirements You must use loops or recursion. If you want to use recursion, you must follow calling conventions and manage stack frames; You must not assume the length of the string numstr , so you must not declare and hardcode any variable representing string length in .data and .text ; You must store the converted integer into variable number ; You must not use any external libraries and functions; You must use system call to print, not printf() ; Write your name and pledge at the top of the code.4 Starter Code & Tester To help you with testing, we provided a tester file tester . Put this tester file in the same directory as your assembly code, and go ahead and run the tester:5 Grading The homework will be graded based on a total of 100 points. Task 1 (20 pts): 5 test cases in total, 4 points each; Task 2 (50 pts): 20 test cases in total, 2.5 points each; Task 3 (30 pts): 10 test cases in total, 3 points each.After accumulating points from the testing above, we will inspect your code and apply deductibles listed below. The lowest score is 0, so no negative scores: Task 1 (20 pts): • -20: the code does not assemble, or executes with run-time error; • -20: the code is generated by compiler; • -20: no loop/recursion; • -20: used any external libraries and/or functions (e.g., printf() ); • -15: not managing stack frames and/or not following calling conventions if using recursion; • -15: declared/hardcoded string length; • -5: no pledge and/or name in assembly file; Task 2 (50 pts): • -50: the code does not assemble, or executes with run-time error; • -50: the code is generated by compiler; • -50: no loop/recursion; • -45: the algorithm is not binary search; • -40: not managing stack frames and/or not following calling conventions if using recursion; • -30: used any external libraries and/or functions (e.g., printf() ); • -5: no pledge and/or name in assembly file; Task 3 (30 pts): • -30: the code does not assemble, or executes with run-time error; • -30: the code is generated by compiler; • -30: no loop/recursion; • -30: used any external libraries and/or functions (e.g., printf() ); • -30: the converted number is not stored in memory; • -20: not managing stack frames and/or not following calling conventions if using recursion; • -20: declared/hardcoded any data that represents the length of the string; • -5: no pledge and/or name in assembly file.

$25.00 View

[SOLVED] Cs3220 lab #5 : a case study of a risc-v with an external

ALU 100 pts in total, will be rescaled into 11.25% of your final score of the course. Part 1: Connect An External ALU with A RISC-V: 60 pts Part 2: Performance Optimization: 40 pts + 10 bonus pts Submission ddl: Nov 6th –> Nov 8th This lab builds upon the knowledge you’ve gained from previous lectures and labs on RISC-V CPU design, as well as your research into AI accelerator implementations. Specifically, you’ll be integrating the RISC-V CPU you designed in earlier labs with an external ALU to enhance its efficiency for certain complex workloads. This is the first in a series of three labs on this topic. Part 1: Connect An External ALU with A RISC-V (60 points): In this section, you’ll integrate the RISC-V CPU you designed in Lab #2 with a supplied external ALU. Your responsibility is to adjust the RISCV implementation to accommodate the external ALU’s operations and verify that the RISC-V CPU can accurately run the given test cases. The external ALU has following specifications: OP1 and OP2 are 32-bit inputs that specify the values to be used as operands for the ALU operation. (Floating point numbers in IEEE 754 format) OP3 is a 32-bit output that holds the result of the ALU operation. (Floating point numbers in IEEE 754 format) ALUOP is a 4-bit input that specifies the ALU operation to be performed. The ALUOP values are as follows: 0001: MULT 0010: DIV CSR_ALU_OUT (Control/Status Register) is a 3-bit input port that represents the status of the ALU operation. The CSR_ALU_OUT values are as follows: CSR_ALU_OUT[0] is a 1-bit output that signals if the ALU OP1 port is READY/BUSY i.e., whether the ALU will be able to latch in your inputs (operands and ALUOP) CSR_ALU_OUT[1] is a 1-bit output that signals if the ALU OP2 port is READY/BUSY i.e., whether the ALU will be able to latch in your inputs (operands and ALUOP) CSR_ALU_OUT[2] is a 1-bit output that signals if the result of the ALU operation is VALID/INVALID 1: VALID; 0: INVALID CSR_ALU_IN is a 3-bit output that control the status of the ALU operation. The CSR_ALU_IN values are as follows: CSR_ALU_IN[0] is a 1-bit input that signals the the results can be overwritten by the ALU. After reading the output, the CPU should set CSR_ALU_IN[0] to 0, indicating it’s safe for ALU to overwrite the results; otherwise, the ALU will stall the current operation write the result to OP3. CSR_ALU_IN[1] is a 1-bit input that signals the OP1 fed to the ALU is stable If it’s set to 1, the ALU will latch in the OP1 value; otherwise, the ALU will stall the current operation and wait for OP1 to be stable. It’s ignored if the ALU is not ready to accept OP1. CSR_ALU_IN[2] is a 1-bit input that signals the OP2 fed to the ALU is stable The ALUOP need to be loaded first and the operands OP1 and OP2 need to be loaded in order. The ALU is data driven, i.e., it will start the computation as soon as the operands are loaded, based on the loaded ALUOP. Potential delay between the two operands’ loading, i.e., ALU can potentillay not be ready to load OP2 when OP1 is loaded. The ALU is adapted from this implementation: https://github.com/dawsonjon/fpu https://dawsonjon.github.io/Chips-2.0/language_reference/interface.html The specifications from RISC-V CPU is as follows: 1. For loading the operands, we will use LW instructions, to load the operands from the memory, with dst reg ID: 11110: OP1 11111: OP2 2. For loading the ALUOP to configure the ALU, we will use LW instructions, with dst reg ID 11101: ALUOP 3. For reading the result/status from the ALU, we will use SW instructions, with src reg ID 11011: OP3 11010: CSR_ALU_OUT 4. Intended instruction sequence: load ALUOP load OP1, OP2 (OP1 and OP2 need to be loaded in order) store OP3 Your tasks are as follows: 1. Integrate the ALU with the RISC-V CPU. You will need to modify the RISC-V CPU to accommodate the ALU’s operations. * Go over all the TODOs and finish the implementation. (FU_stage.v, de_stage.v) 2. You can assume enough NOPs inserted to separate the operands loading and storing the results. * In other words you don’t need to worry about the stalls needed to handle the ALU’s readiness. To pass this part and earn full credit, implement the integration described above and run your implementation on alutest0.mem and ensure it passes this testcase. * You can use the ./run_tests.sh part5 to test your implementation. Part 2: Performance Optimization (40 points + 10 bonus pts) What if there is no NOPs inserted between OP1 loading and OP2 loading, and the ALU might not be ready to load either OP1 or OP2? To pass this part and earn full credit, implement the integration described above and run your implementation on alutest1.mem and ensure it passes this testcase. * You can use the ./run_tests.sh part6 to test your implementation. Bonus points: When the implementation is instructed to store ALU’s results to the memory, it’s possible that the ALU is still processing. It’s even possible that the instruction to store OP3 is issued before ALU even finishes loading either OP1 or OP2. Modify the part 2 implementation to handle the stalls needed to handle stalls needed to handle the ALU’s results storing to the memory. Your implementation should still work on the testcases in part 1 and part 2. To pass this part and earn full credit, implement the integration described above and run your implementation on alutest2.mem and ensure it passes this testcase. * You can use the ./run_tests.sh part7 to test your implementation. Submission Provide a zip file containing your source code. Generate the submission.zip file using the command make submit. Avoid manual zip file creation to prevent any issues with the autograding script, which could lead to a 30% score deduction.

$25.00 View

[SOLVED] Cs3220 – lab 4 – ai accelerator case study – dnnbuilder

Objective: The primary aim of this lab is to synthesize the skills and knowledge acquired in previous labs focused on digital design and RTL programming. This will be applied in a practical case study centered on Artificial Intelligence (AI) accelerators. 1. Paper summary: 1. Paper link: DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs 2. Format for paper summary: 1. Organized section layout, comprising: 1. Abstract: Provide a high-level description of the paper’s main contributions. (Note: Do not copy the original abstract.) 2. Motivation: Explain the significance of the techniques introduced in the paper. 3. Methods: Outline the key technical aspects and methodologies presented in the paper. 4. Effectiveness: Discuss how the paper’s experiments validate the efficacy of the proposed techniques. 5. Summary: Conclude with an overall assessment of the paper’s contributions and impact. 6. Under 1000 words; figures are welcome but do not directly copy the ones in the paper; show your understanding 3. Submission: A txt file containing the above contents 2. Code implementation understanding and documentation: 1. Available Modules 1. High-Level Modules: These are RTL modules primarily responsible for instantiating and utilizing lower-level modules. They provide a broader view of the overall design architecture. 2. Low-Level Modules: These RTL modules contain detailed logic implementations, offering a more nuanced understanding of the control scheme. 2. Module assignment: 2. Assignment will be listed in a sheet posted in both Canvas and Piazza 3. Documentation format: 1. Summary of the code file 2. Line by line comments of the code (meaningless lines can be skipped, e.g., “beginâ€, “endâ€, parentheses…) 3. Example code documentation 4. Submission: 1. All documented code pieces 1. Summary of the code file in a separate txt file 2. Line by line in code comments 2. Do not change the original code file name 5. The source code is adapted from: https://github.com/IBM/AccDNN 3. Submission Format: 1. Copy the makefile to your folder containing paper summary txt and documented code pieces 1. Double check the zip file contain necessary contents 2. make submit 3. Rename submission.zip to .zip 1. Paper summary (30 points): 1. Abstract (6 points) Clarity and conciseness: 3 points Accurate representation of the paper’s main contributions: 3 points 2. Motivation (6 points) Explanation of the paper’s significance: 3 points Relevance to the preceding working and existing solutions: 3 points 3. Methods (6 points) Clarity in outlining key technical aspects: 4 points Depth of understanding: 2 points 4. Effectiveness (6 points) Discussion of the paper’s experimental validation: 4 points Critical evaluation of the results: 2 points 5. Summary (3 points) Overall assessment of the paper: 2 points Coherence and flow of the summary: 1 points 6. Formatting and Structure (3 points) Adherence to guidelines: 2 points Grammar and spelling: 1 points 2. Code documentation (17.5 points / code file; 70 points in total): 1. Summary of the Code File (7.5 points) Clarity and conciseness: 3.5 points Accurate representation of the code’s main functionalities: 4 points 2. Line-by-Line Comments (10 points) Completeness: Covering all meaningful lines of code: 5 points Clarity: Making complex or non-intuitive lines understandable: 5 points Bonus Points (15 points): Having gained a foundational understanding of the design and implementation of accelerators from academia, are you interested in exploring how industry-grade accelerators are developed? For this bonus assignment, you will delve into the architecture and codebase of NVIDIA’s Deep Learning Accelerator (NVDLA), a leading example of an industry-grade AI accelerator. This assignment is both challenging and open-ended, offering you considerable latitude in your approach. Your primary task is to Choose one or more modules within the NVDLA architecture that interest you. Document the code, focusing on its structure, functionality, and any unique features. NVDLA source code: https://github.com/nvdla/hw/tree/nvdlav1/vmod/nvdla NVDLA documentation: http://nvdla.org/hw/v1/hwarch.html Grading: Two modules’ full documentation at this level will lead to full points Your documentation will be scaled with the above level

$25.00 View

[SOLVED] Cs3220 lab #2 : branch prediction

100 pts in total, will be rescaled into 11.25% of your final score of the course. Part 1: Baseline Branch Predictor: 50 pts Part 2: Performance Measurement & Optimization: 50 pts + 10 bonus pts (overflow allowed) Submission ddl: Feb 17th Part 1: Baseline Branch Predictor (50 points): In this part, you’ll be implementing a baseline branch predictor and a branch target buffer for your RISC-V CPU. Here’s a concise overview of the design: 1. The branch history register (BHR) has a length of 8 bits, you will use PC[9:2] XOR BHR to index a Pattern History Table (PHT). 3. The PHT is composed of 2^8 2-bit counters to make branch prediction. Each counter is initialized with 1 (indicating a weakly not taken). 4. The branch target buffer (BTB) has 16 entries, and you will use PC[5:2] to index it. Each entry of the BTB is composed of 3 parts: a valid bit, a tag field, and a target address: 5. the tag field is used to determine whether the current PC address in the FE stage is the one recorded in the BTB entry; 6. the valid bit is used to identify whether this entry contains a valid history, rather than unused; 7. the target address is used to predict the branch / jump target. Summary of the G-share branch prediction algorithm: FE Stage (fe_stage.v): Both BTB and PHT are concurrently accessed in this stage. 1. If there’s a BTB hit, use PHT outcome to determine the target address for the next instruction fetch: if the outcome is taken, use BTB target address. If BTB misses, use PC+4 for next instruction. 2. The address for the next instruction fetch and index (PC[9:2] XOR BHR) used in FE stage is passed to EX stage for PHT update. EX stage (agex_stage.v): 1. Check if the next instruction fecthed in the FE stage is correct or not: if not, flush the pipeline. 2. If the branch is taken, and the next instruction we fetched is not the branch target, we are supposed flush the pipeline; 3. If the branch is not taken, and the next instruction we fetched is not PC+4, we should flush the pipeline. 4. For branch instructions (bne, beq, jalr, etc.), insert the target address into the BTB, no matter taken or not. 5. If PHT is used for branching prediction in the FE stage, update PHT using the propagated PHT index (PC[9:2] XOR BHR). 6. Update the BHR. 7. As BHR and PHT are implemented in the FE stage, you are supposed to forward the relevant signals to the FE stage for the updates mentioned in 2, 3 and 4 via from_AGEX_to_FE. To pass earn full credit of this part, implement the baseline branch predictor described above and make sure your baseline branch predictor passes testall.mem and all testcases under part2. Grading: We will check the testcases are correctly executed or not. There won’t be any performance improvement in testall.mem because the branch should always be predicted to not taken, since each branch instruction is executed only once (the branch predictor only works if you encounter same instruction multiple times). This testcase is only intend to test that the branch predictor you implement is not distructive for the other functionalities of the RISC-V processor. Part 2: Performance Measurement & Optimization (50 points + 10 bonus pts) 1. [40 pts] For this part, you will evaluate branch prediction accuracy by adding counters to measure it (# of correctly predicted branches / # total branch instructions). Utilize the towers.mem testcase for this assessment and write your measurement results in a pdf report. 2. Note that jump instructions should be counted as branch instructions here as well. 4. To gain credits from part-2, your baseline predictor should have > 30% accuracy. 5. [10 pts + 10 pts bonus] Enhance the performance of your branch predictor on the towers.mem testcase by making design changes: you can explore other BHR hashing functions (e.g. using different bits of PC for the XOR operation), or change the PHT or BTB sizes. Implement at least three different design changes, and present the corresponding performance outcomes in your report. If your modifications result in more than a 5% increase in prediction accuracy compared to the baseline branch predictor, you will earn 10 bonus points. Submission Provide a zip file containing your source code for Part 1. Generate the submission.zip file using the command make submit. Avoid manual zip file creation to prevent any issues with the autograding script, which could lead to a 30% score deduction. Submit a concise PDF report for Part 2 (limited to 2 pages) containing the following information: Your performance measurements for the baseline G-share branch predictor and your three variants. Discuss the design parameters that were modified and explain how these changes influenced branch prediction accuracy, either positively or negatively. Do not put this pdf inside the zip file. No need to submit code for Part 2. FAQ [Q] I passed testall.mem but failed to pass some testcases under test/part2. What should I do? [A] Please carefully check whether your when-toflush logic is correctly implemented in the AGEX stage based on the following criteria: If the branch is taken, and the next instruction we fetched is not the branch target, we are supposed flush the pipeline; If the branch is not taken, and the next instruction we fetched is not PC+4, we should flush the pipeline as well [Q] I’m debugging my code. I see that there is an X in the BTB. How would it be possible? [A] FE stage can have pipeline bubbles. Therefore, BTB/BHT might be indexed with uninitialized values. Please make sure when you update BTB/BHT, only branch instructions/signals (not including X) can change the BTB/BHT values. [Q] I don’t see performance improvement in testall.mem. Why? [A] This is expected. All branch code in testall.mem are executed only once and not-taken. In order to make a branch predictor work, the processor has to see the same branch over and over. W/o training, the branch predictor would’t work well. [Q] Do we insert a BTB entry only for the taken branch or even when it is not taken? [A] You need to insert a BTB entry even the branch is not taken. Because the same branch might be taken in the next time. [Q] If we insert a not-taken branch for the BTB entry, what will be the target address? [A] You can still compute the target address as if it is taken and insert it in the BTB. [Q] What if the target in the BTB is wrong? [A] Just like a branch misprediction, we flush the pipeline and also update the BTB with the correct information. [Q] With a branch predictor, will the pipeline still have pipeline bubbles? [A] The pipeline will have pipeline bubble for dependency stalls but not for branch instructions. [Q] I want to add a new file (bp.v). can I? [A] Please do not add new file, as it might break our auto-grading script. [Q] Do I have to show the performance improvement in order to get a full-credit for part 1? [A] No. the performance improvement needs to be demonstrated in part 2 only. [Q] Are we expected to implement data forwarding in lab 2? [A] No. [Q] Let’s say my instruction stream is as follows: BR(1) ADD BR(2) . When BR(1) is in EX, it will update the BHR. But BR(2) will be in FE at that time. Which value of BHR should FE use? The old value or the updated value from EX? [A] This is one of the optimization opportunities. So how you handle this case is up to you. Please remember that the branch predictor is just a predictor and it won’t affect the correctness of the program. [Q] How to initialize PHT as one? [A] You should explicitly put 1s when it resets. [Q] I ran tower.mem and my test case is failed unlike other test cases. Is that expected? [A] Yes. The tower.mem returns “255”, which does not match the PASS criteria of the simulator. You do not need to worry about it.

$25.00 View

[SOLVED] Cs3220 lab #3 : a case study of a risc-v with an external

ALU 100 pts in total, will be rescaled into 11.25% of your final score of the course. Part 1: Connect An External ALU with A RISC-V: 60 pts Part 2: Performance Optimization: 40 pts Bonus: 10 pts Submission ddl: Mar 3rd This lab builds upon the knowledge you’ve gained from previous lectures and labs on RISC-V CPU design. Specifically, you’ll be integrating the RISC-V CPU you designed in earlier labs with an external ALU to enhance its efficiency for certain complex workloads. Part 1: Connect An External ALU with A RISC-V (60 points): In this section, you’ll integrate the RISC-V CPU you designed in Lab #2 with a supplied external ALU. Your responsibility is to adjust the RISCV implementation to accommodate the external ALU’s operations and verify that the RISC-V CPU can accurately run the given test cases. The external ALU has following specifications: OP1 and OP2 are 32-bit inputs that specify the values to be used as operands for the ALU operation. (Floating point numbers in IEEE 754 format) OP3 is a 32-bit output that holds the result of the ALU operation. (Floating point numbers in IEEE 754 format) ALUOP is a 4-bit input that specifies the ALU operation to be performed. The ALUOP values are as follows: 0001: DIV 0010: MULT CSR_ALU_OUT (Control/Status Register) is a 3-bit input port that represents the status of the ALU operation. The CSR_ALU_OUT values are as follows: CSR_ALU_OUT[0] is a 1-bit output that signals if the ALU OP1 port is READY/BUSY i.e., whether the ALU will be able to latch in your inputs (operands and ALUOP) CSR_ALU_OUT[1] is a 1-bit output that signals if the ALU OP2 port is READY/BUSY i.e., whether the ALU will be able to latch in your inputs (operands and ALUOP) CSR_ALU_OUT[2] is a 1-bit output that signals if the result of the ALU operation is VALID/INVALID 1: VALID; 0: INVALID CSR_ALU_IN is a 3-bit output that control the status of the ALU operation. The CSR_ALU_IN values are as follows: CSR_ALU_IN[0] is a 1-bit input that signals the the results can be overwritten by the ALU. When set to 1, it acknowledges the ALU that the output is received and can be overwritten in following cycles; thus the output will become unstable CSR_ALU_IN[1] is a 1-bit input that signals the OP1 fed to the ALU is stable If it’s set to 1, the ALU will latch in the OP1 value; otherwise, the ALU will stall the current operation and wait for OP1 to be stable. It’s ignored if the ALU is not ready to accept OP1. CSR_ALU_IN[2] is a 1-bit input that signals the OP2 fed to the ALU is stable The ALUOP need to be loaded first and the operands OP1 and OP2 need to be loaded in order. The ALU is data driven, i.e., it will start the computation as soon as the operands are loaded, based on the loaded ALUOP. Potential delay between the two operands’ loading, i.e., ALU can potentillay not be ready to load OP2 when OP1 is loaded. The ALU is adapted from this implementation: https://github.com/dawsonjon/fpu https://dawsonjon.github.io/Chips-2.0/language_reference/interface.html The specifications from RISC-V CPU is as follows: 1. For loading the operands, we will use LW instructions, to load the operands from the memory, with dst reg ID: 11110: OP1 11111: OP2 2. For loading the ALUOP to configure the ALU, we will use LW instructions, with dst reg ID 11101: ALUOP 3. For reading the result/status from the ALU, we will use SW instructions, with src reg ID 11011: OP3 11010: CSR_ALU_OUT 4. Intended instruction sequence: load ALUOP load OP1, OP2 (OP1 and OP2 need to be loaded in order) store OP3 Your tasks are as follows: 1. Integrate the ALU with the RISC-V CPU. You will need to modify the RISC-V CPU to accommodate the ALU’s operations. * Go over all the TODOs and finish the implementation. (FU_stage.v, de_stage.v) 2. You can assume enough NOPs inserted to separate the operands loading and storing the results. * In other words you don’t need to worry about the stalls needed to handle the ALU’s readiness. To pass this part and earn full credit, implement the integration described above and run your implementation on alutest0.mem and ensure it passes this testcase. * You can use the ./run_tests.sh part5 to test your implementation. Part 2: Performance Optimization (40 points) What if there is no NOPs inserted between OP1 loading and OP2 loading, and the ALU might not be ready to load either OP1 or OP2? To pass this part and earn full credit, implement the integration described above and run your implementation on alutest1.mem and ensure it passes this testcase. * You can use the ./run_tests.sh part6 to test your implementation. Bonus points (10 points) When the implementation is instructed to store ALU’s results to the memory, it’s possible that the ALU is still processing. It’s even possible that the instruction to store OP3 is issued before ALU even finishes loading either OP1 or OP2. Modify the part 2 implementation to handle the stalls needed to handle stalls needed to handle the ALU’s results storing to the memory. Your implementation should still work on the testcases in part 1 and part 2. To pass this part and earn full credit, implement the integration described above and run your implementation on alutest2.mem and ensure it passes this testcase. * You can use the ./run_tests.sh part7 to test your implementation. Submission Provide a zip file containing your source code. Generate the submission.zip file using the command make submit. Avoid manual zip file creation to prevent any issues with the autograding script. Submit the zip file to gradescope. Q1. Under the RISC-V specifications, we are given destination and source register ids for using the LW and SW instructions, but I am a bit confused about it’s usage, for example to access OP3 for example do I do SW(11011)? And if so would this be done in the mem stage? A1. As the one making the processor, you don’t have to issue any instructions. The details about SW and LW relate to what the programmer does to use the ALU. But it is your job to make sure that if there is a SW instruction with the specified register id, that the ALU and CPU are ready to transfer the output to the CPU or have already done so. In this case, the CSR signals are key to coordinating this. Q2. Are the registers for OP1, OP2, OP3, and ALUOP only ever used for this external ALU? Do we have to handle the case where these registers can be used as general purpose registers as well as registers for this external ALU or are we guaranteed that these registers are only ever used for external ALU computation? A2. For this lab, these registers are reserved only for external ALU. Q3. I am modifying the DE stage for part1, and I saw this comment provided to us: “//Recommended states transition: load aluop –> load op1 -> load op2 –> alu processing –> store results to memory”. I understand all the states except for the alu processing, are we supposed to invoke the external alu here? A3. The external ALU is able to invoke itself. Once it receives op1 and op2, I believe it moves itself into the execute stage and does the calculation. This takes some amount of time, so the processor has to wait until the result is ready to bring it back to the CPU. Then you’ll have to coordinate between the ALU and CPU to bring the result back. Part 1 and 2 both have enough NOPs in between the loadings and the store that you won’t run into a case where the SW is issued, but the data isn’t ready. Q4. Decode Stage store to memory. How would we store to memory in the decode stage as the TODO says: “//store results to memory;” I do not see anywhere to access the memory there. A4. The programmer will issue an SW instruction to store the result in register 27 to somewhere. By the time this instruction reaches DE, register 27 should contain the result from the ALU. If this instruction runs through, you’ve effectively stored the result to memory. The functionality to get the data from the ALU to CPU in register 27 is for us to implement. Q5. The readme mentions that CSR_ALU_OUT can be saved to memory by the programmer. Do we assume that whatever value is in the designated register is the correct value, because there doesn’t appear to be signals that indicate status stability. Do we assume the same thing for ALU_OP? My understanding is the ALU will only latch to operands when CSR_ALU_IN[1,2] are set, but there is no signal for ALU_OP. Do we assume that it can be loaded at any time? A5. (1) CSR_ALU_OUT and CSR_ALU_IN are themselves the signals to tell if the data signals ready/valid. It’s not necessary to have another set of ready/valid signals to tell if they are ready/valid. So you can somewhat assume it’s “correct” whatever the value points to, but you need to be very certain when you read these signals. (2) Storing CSR_ALU_OUT is not necessary in THIS lab, so no worry about it. (3) For ALU_OP, imagine a switch, the computation will carry one in a mode wherever the ALU_OP is pointing to at that moment. And there is a default position for it, so if you do not load it will will go to default location. Q6. Confused about how to handle CSR_ALU_IN[0]: when CSR_ALU_IN[0] == 1, that means that ALU cannot overrwrite OP3. My plan is, when wr_reg_WB == OP1 register, we get the OP1 value from the regfile. but if CSR_ALU_IN[0] == 1, we can’t overwrite it OP3, so that means i can’t overwrite OP1 or OP2. so what do we do with this value read in from OP1? A6. On the ALU side, CSR_ALU_IN[0] is used to either keep the unit in standby state(ie it’ll finish the operation and hold the output, but nothing else) or shift the unit into get_a mode. The signal won’t affect you until the next operation, which will block you until you set the CSR_ALU_IN[0] signal to tell the ALU that the CPU’s taken the result in. Then it’s on you to set this signal to the proper value to shift the unit into get_a mode, to which op1 can be loaded in as usual. Q7. In my FU, i am trying to figure out why my part 1 is not working and am curious as to way the FU stage all of a sudden gets a large number such as 3F800000 randomly in despite the numbers I manually provided? A7. The ALU does floating point operations. Translate the value from HEX into floating point. Q8. When should CSR_ALU_IN[0] be 0 or 1? Should it initially be 0 or 1? When would we want to set it to 1? A8. CSR_ALU_IN[0] being set to 1 brings the ALU back to the start state, while setting it to 0 indicates that we are ready for the answer of the ALU. In other words, set it to 1 to start and then once we have loaded all the operands set it to 0.

$25.00 View

[SOLVED] Cs3220 lab #1 : pipeline design

100 pts in total, will be rescaled into 11.25% of your final score of the course. Part 1: 50 pts Part 2: 50 pts Part 3 (Optional): 20 bonus pts Submission ddl: Feb 3 Description: In this assignment, you will create a 5-stage RISC-V pipelined processor using Verilog, focusing on a subset of the RISC-V ISA. We will be using the Tiny RISC-V version from Cornell, which is provided in the Tiny RISC-V ISA file. In part 0, you will familiarize yourself with the essential software tools required for the experiments on the PACE cluster. In part-1, you only need to implement addi, add, beq instructions to pass all 5 test cases in test/part1/test[1-5].mem. In part 2, you will extend your processor by adding more instructions in order to pass the test cases under test/part2/. Part 3 is optional for bonus pts, where you will add more instructions. Part 0: Experiment Setup Please follow the instructions provided to run experiments on the PACE cluster. What to submit: No submission is required for Part 0. However, ensure that you know how to visualize waveforms with GTKWave. Part 1: Minimal functionality In this part, you’ll implement a subset of RISC-V instructions and are required to pass 5 tests under the test/part1 directory. Refer to the test cases and the README file in test/part1 for detailed requirements. 1. [20pts] Complete the agex_stage.v file. No modifications to other files are necessary. Your implementation should pass test/part1/test[15].mem. If not all of the test cases pass, you’ll only receive partial scores. To test all cases together, run run_tests.sh part1, and it will produce part1_results.log and part1_tests.log for you. You can also run each test case independently, see FAQ for part 1. Note: If you encounter latch size errors, modify the corresponding latch size definition in define.vh. 2. [10pts] Explain the actions in each pipeline stage while executing test/part1/test1.mem. Include waveform screenshots illustrating relevant signals. For example, in the Execute stage (EX stag), you should visualize and explain the input (regval1_AGEX, regval2_AGEX) and output (aluout_AGEX) signals of the ALU, and the opcode (op_I_AGEX). 3. [10pts] Explain how your RISC-V processor resolves Read-After-Write hazards in test/part1/test2.mem. Include waveform screenshots illustrating the discussed signals. 4. [10pts] Explain how your RISC-V processor handles branch misprediction in test/part1/test4.mem. Include waveform screenshots illustrating relevant signals. Note: In Lab 1, branches are always predicted as not-taken; in Lab 2, you will implement your own branch predictor. What to submit: Submit the following to Canvas: Include a PDF file containing your explanations and the corresponding screenshots. Start part 2 as early as possible and do not wait untill the last week, as it involves heavier workload than part 1. Part 2: Extending the instruction set Test cases: In part-2, we start to use more advanced RISC-V test cases. *.S is assembly code that takes RISC-V macro. Macros are defined at include/test_macros.h or include/riscv_test.h. It also uses ABI names and Pseudo Instructions. You can find a summary of information [here]. *.dump is an dump file output from gcc riscv compiler. *.mem file has the format for verilog code. *.dec file is useful when using [RISC-V emulator] What to submit: Submit the following to Canvas: Avoid procrastination; start early to manage the workload effectively. Part 3 (Optional) Complete the processor 1. [20pts] In this part, you will complete the processor to fully support the RISC-V RV32I (except CSR instructions). Your goal is to ensure your program passes all the test cases in the test/part3/ directory. To receive full credits, your program must pass test/part3/testall.mem. Partial scores will be awarded based on the coverage of the Part 3 test suites. What to submit: Submit the following to Canvas: Useful Information References RISC-V RV32I Mannual RISC-V Instruction Card RISC-V emulator (tiny RV2) Verilator manual GTKWave manual Tutorial of the RISC-V TEST SUITE FAQ for part 1 (Q) How do I run a specific test file? (A) Please see “define.vh”: you need to change line 21 to change which test file to read: `define IDMEMINITFILE “/home/zhifan/workspace/cs3220-23fall/lab1/test/part1/test4.mem”. You need to change “test4.mem” into (Q) Debugging takes so much time. Any tips to reduce the debugging time? (A) Some suggestions: 1. Review code carefully and understand the ISA behavior correctly. 2. If make command fails to compile, read the error messages carefully. 3. make command generates vcd file. Please use GTKWave to see important signals and check whether the signals works as expected according to *.asm files or RISC-V enumlators. When debugging, it is always helpful to visualize clk signal and pc values along with other important signals. (Q) How do I know whether my implementation is correct or not? (A) If you run make with a correct implementation, you will see a “Pass” message. (Q) Can I add new files? (A) Yes, but please make sure they are added in the zip file. (Q) Do we need to implement a branch predictor? (A) It’s not required for lab 1. (Q) Do we need to create a stack for nested JAL instructions? (A) The hardware are not aware of any nested functions calls, so you do not need to implement it. (Q) BEQ t1, t1, imm : if a branch is taken, is the new PC = PC + imm or new PC = PC + 4+ imm? (A) The answer is PC = PC + offset. Please be careful when converting imm to offset. (Q) Do we need to worry about whether we should prevent all writes to the zero register and treat it as always zero, or if that is solely up to us dependent on our design? (A) This is purely S/W job. The H/W doesn’t have to check whether x0 is writable or not. The Hardware also doesn’t have explicitly insert 0 in hardware. (Q) Is the immediate field inside assembly code decimal? (A) If the number starts with 0x, it’s hexadecimal. (Q) What does assign inst_FE = imem[PC_FE_latch[`IMEMADDRBITS-1:`IMEMWORDBITS]]; mean? (A) PC_FE_latch contains PC value. Again imem and dmem are word addressable, so we don’t need LSB 2 bits. Since imem and dmem has only 2^14 size, we just use addr [15:2] bits to index imem/dmem. (Q) How do I know what is the correct instruction/code behavior? (A) You can use RISC-V enumlators or other RISC-V machine to execute the code. One example is here. (Q) My code does not load any instructions. How do I fix that? (A) Carefully check your error messages and make sure you have set IDMEMINITFILE to the right path. FAQ for part 2 (Q) I’m not sure how to understand part 2 test code. (A) The test in test/part2 is modified code from RISC-V test suite. It uses macro function to generate test code. (Q) What is li instructions in add.dump? (A) li instruction is one of the pseudo instructions. It is the same as addi x0, imm (Q) I passed test[1-5].mem. why do I fail addi.mem? (A) It contains bne, auipc, jal instructions as well. So in order to pass part 2 test cases, you need to complete those instructions. (Q) I’d like to use RISC-V emulator for testing the test code, but it won’t take dump file. what should I do? (A) You can use the test/dumptoasm.py to extract the assembly code. (Q) Behavior of lui. The documentation says that – Semantics : R[rd] = imm

$25.00 View

[SOLVED] Cs3220 lab #3 (10 pts)

10 pts in total, will be rescaled into 11.25% of your final score of the course. Part 0: Env Setup: 0 pts Part 1: Deploy on Pynq-Jupyter: 10 pts Submission ddl: Oct 9nd This lab serves as a continuation of Lab #2. The primary aim is to guide you through the process of deploying your RISC-V processor on a Pynq board. Learning Outcomes: 1. Learn to create and use AXI lite protocol to communicate with your RISC-V processor. 2. Get familiar with Vivado and Vitis HLS toolchain. Part-0: Env Setup (0 pts) Accessing Pynq Board We’ve settled remote access to the pynq board. Please follow the instructions outlined in this: document. Setting up the environment might take some time. Kindly be patient. Remote Desktop Updalod & Download Files You can access the root directory of your remote machine through your browser. Refer to step 4 of the above document, click on the “data root directory” at the bottom of the screen. The remote desktop and pynq board share the same data storage, so it is not needed to transfer data between them. Part-1: Deployment on Pynq-Jupyter (10pts) In this part, you will deploy your RISC-V processor on a pynq board. The pynq board provides a field programmable gate array, which allows you to program its hardware. You will be able to communicate with the board through AXI lite protocol with a Jupyter notebook. Step-1: Vitis for Creating a Communication Adapter In this step, you will generate a communication adapter using comm.cpp in Vitis HLS. This allows you to communicate with the verilog modules (your RISC-V processor). The code comm.cpp only defines ports (inputs and output arguments) to verilog modules with memory-mapped connection using AXI lite protocol. So you can consider this vitis code as an communication adapter, and vitis generates most of the necessary logics for us. Step-2: Vivado for Bitstream Generation [1] Prepare your codes: Modify the pipeline.v to have two additional ports and change reset to rest_n, you will interact with it from the CPU side (Jupyter Notebook) through these ports. Before: module pipeline ( input wire clk, input wire reset ); After: “` module pipeline( input clk, input reset_n, output[31:0] out1, output[31:0] out2 ); wire reset = ~reset_n; “` In pipeline.v, connect out1 for cycle_count: “` always @ (posedge clk) begin if (reset) begin cycle_count Export bitstream. [10] Prepare the files Copy the following files: + [proj_name].runs/impl_1/design_1_wrapper.bit or where you expored the bitstream in step [9]. + [proj_name].runs/impl_1/design_1_wrapper.tcl + [proj_name].gen/sources_1/bd/design_1/design_1.hwh. Make sure you rename all the files to have the same name (e.g. riscv.bit, riscv.tcl, riscv.hwh) Step-3: Deploy on the Pynq Board [11] Upload the files Place all the generated files in the above step and the riscv_test.ipynb file in a same folder. [12] Running on the Pynq Board Open the riscv_test.ipynb file on the requested Jupyter notebook and run the code, the 0x20 address corresonds to out1 and 0x30 address corresponds to out2, out1 value will keep changing since it’s a cycle count and out2 value will be the constant you put in the beginning of step 2. include the screenshot of ipynb on your report Submission Guideline What to submit A zip file containing your .bit, .tcl, .hwh files; A screenshot of the jupyter notebook, showing the expected value in step [12]. Grading policy If the bitstream you submit can be successfully deployed on the pynq board and the screenshot in step [12] is correct, you will receive full credit.

$25.00 View

[SOLVED] Cs3220 lab #2 : branch prediction

100 pts in total, will be rescaled into 11.25% of your final score of the course. Part 1: Baseline Branch Predictor: 60 pts Part 2: Performance Measurement & Optimization: 40 pts + 10 bonus pts Submission ddl: Oct 2nd Part 1: Baseline Branch Predictor (60 points): In this part, you’ll be implementing a baseline branch predictor and a branch target buffer for your RISC-V CPU. Here’s a concise overview of the design: 1. Its branch history register (BHR) has a length of 8 bits, you will use PC[9:2] XOR BHR to index a Pattern History Table (PHT), which is composed of 2^8 2-bit counters for branch prediction. Each counter is initialized with 1 (indicating a weakly not taken). 2. The branch target buffer (BTB) has 16 entries, and you will use PC[5:2] to index it. Summary of the G-share branch prediction algorithm: FE Stage (fe_stage.v): Both BTB and PHT are concurrently accessed in this stage. 1. If there’s a BTB hit, use PHT outcome to determine the target address for the next fetch: if the outcome is taken, use BTB target address. If BTB misses, use PC+4 for next instruction. 2. The index (PC[9:2] XOR BHR) used in FE stage is passed to EX stage for PHT update. EX stage (agex_stage.v): 1. If the predicted address is incorrect, flush the pipeline. 2. For branch instructions (bne, beq, jalr, etc.), insert the target address into the BTB, whether taken or not. 3. If PHT is used for branching prediction in the FE stage, update PHT using the propagated PHT index (PC[9:2] XOR BHR). 4. Update the BHR. To pass this part and earn full credit, implement the baseline branch predictor described above and run your baseline branch predictor on testall.mem and ensure it passes this testcase. Part 2: Performance Measurement & Optimization (40 points + 10 bonus pts) 1. [10 pts] For this part, you will evaluate branch prediction accuracy by adding counters to measure it (# of correctly predicted branches / # total branch instructions). Utilize the towers.mem testcase for this assessment and write your measurement results in a pdf report. 2. [30 pts + 10 pts bonus] Enhance the performance of your branch predictor on the towers.mem testcase by making design changes: you can explore other BHR hashing functions (e.g. using different bits of PC for the XOR operation), or change the PHT or BTB sizes. Implement at least three different design changes, and present the corresponding performance outcomes in your report. If your modifications result in more than a 5% increase in prediction accuracy compared to the baseline branch predictor, you will earn 10 bonus points. Submission Provide a zip file containing your source code for Part 1. Generate the submission.zip file using the command make submit. Avoid manual zip file creation to prevent any issues with the autograding script, which could lead to a 30% score deduction. Submit a concise PDF report for Part 2 (limited to 2 pages) containing the following information: Your performance measurements for the baseline G-share branch predictor and your three variants. Discuss the design parameters that were modified and explain how these changes influenced branch prediction accuracy, either positively or negatively. FAQ [Q] I passed testall.mem but failed to pass some testcases under test/part2. What should I do? [A] Please carefully check whether your when-toflush logic is correctly implemented in the AGEX stage based on the following criteria: When should we flush the pipeline? If the branch is not taken, and next instruction we fetched is not PC+4, we should flush the pipeline; if the branch is taken, and the next instruction we fetched is not the branch target, we are supposed flush the pipeline as well. [Q] I’m debugging my code. I see that there is an X in the BTB. How would it be possible? [A] FE stage can have pipeline bubbles. BTB/BHT might be indexed with uninitialized values. Please also make it sure when you update BTB/BHT, only branch instructions/signals (not including X) can change the BTB/BHT values. [Q] I don’t see performance improvement in testall.mem. Why ? [A] All branch code in testall.mem are executed only once and not-taken. In order to make a branch predictor work, the processor has to see the same branch over and over. W/o training, the branch predictor would’t work well. [Q] Do we insert a BTB entry only for the taken branch or even for not-taken a branch? [A] You insert a BTB entry even for the not-taken branch. Because the same branch might be taken in the next time prediction. [Q] If we insert a not-taken branch for the BTB entry, what will be the target address? [A] You can compute the potential target address and insert it in the BTB. [Q] What if the target in the BTB is wrong? [A] Just like a branch misprediction, we flush the pipeline and also update the BTB with the correct information. [Q] With a branch predictor, will the pipeline still have pipeline bubbles? [A] The pipeline will have pipeline bubble for dependency stalls but not for branch instructions. [Q] I want to add a new file (bp.v). can I? [A] Please do not add new file, as it might break our auto-grading script. [Q] Do I have to show the performance improvement in order to get a full-credit for part 1? [A] No. the performance improvement needs to be demonstrated in part 2 only. [Q] Are we expected to implement data forwarding in lab 2? [A] No. [Q] Let’s say my instruction stream is as follows: BR(1) ADD BR(2) . When BR(1) is in EX, it will update the BHR. But BR(2) will be in FE at that time. Which value of BHR should FE use? The old value or the updated value from EX? [A] This is one of the optimization opportunities. So how you handle this case is up to you. Please remember that the branch predictor is just a predictor and it won’t affect the correctness of the program. [Q] How to initialize PHT as one? [A] You should explicitly put 1s when it resets. [Q] I ran tower.mem and my test case is failed unlike other test cases. Is that expected? [A] Yes. The tower.mem returns “255”, which does not match the PASS criteria of the simulator. You do not need to worry about it.

$25.00 View

[SOLVED] Cs3220 lab #1 : pipeline design

100 pts in total, will be rescaled into 11.25% of your final score of the course. Part 1: 50 pts, submission ddl: Sep 11th Part 2: 50 pts, submission ddl: Sep 18th Part 3 (Optional): 20 bonus pts, submission ddl: Sep 18th Description: In this assignment, you will create a 5-stage RISC-V pipelined processor using Verilog, focusing on a subset of the RISC-V ISA. We will be using the Tiny RISC-V version from Cornell, which is provided in the Tiny RISC-V ISA file. In part 0, you will familiarize yourself with the essential software tools required for the experiments on the PACE cluster. In part-1, you only need to implement addi, add, beq instructions to pass all 5 test cases in test/part1/test[1-5].mem. In part 2, you will expand your processor by adding more instructions to pass the test cases under test/part2/. Part 3 is optional for bonus pts, where you will complete the RISC-V processor. Part 0: Experiment Setup Please follow the instructions provided to run experiments on the PACE cluster. What to submit: No submission is required for Part 0. However, ensure that you can independently utilize GTKWave to visualize waveforms effectively. Part 1: Minimal functionality In this part, you’ll implement a subset of RISC-V instructions and aim to pass 5 tests in the test/part1 directory. Refer to the test cases and the README file in test/part1 for detailed requirements. 1. [20pts] Complete the agex_stage.v file. No modifications to other files are necessary. Your implementation should pass test/part1/test[15].mem. If all test cases don’t pass, you’ll receive partial scores. To test all cases together, run run_tests.sh part1, and it will produce part1_results.log and part1_tests.log for you. You can also run each test case independently, see FAQ for part 1. Note: If you encounter latch size errors, modify the corresponding latch size definition in define.vh. 2. [10pts] Explain the actions in each pipeline stage while executing test/part1/test1.mem. Include waveform screenshots illustrating relevant signals. For example, in the Execute stage (EX stag), you should visualize input (regval1_AGEX, regval2_AGEX) and output (aluout_AGEX) signals of the ALU, and the opcode (op_I_AGEX). 3. [10pts]Explain how your RISC-V processor resolves Read-After-Write hazards in test/part1/test2.mem. Include waveform screenshots illustrating the discussed signals. 4. [10pts] Explain how your RISC-V processor handles branch misprediction in test/part1/test4.mem. Include waveform screenshots illustrating relevant signals. Note: In Lab 1, branches are always predicted as not-taken; in Lab 2, you will implement your own branch predictor. What to submit: Submit the following to Canvas: Include a PDF file containing your explanations and corresponding screenshots. Start part 2 as early as possible and do not wait untill the last week, as it involves heavier workload than part 1. Part 2: Expanding instruction set Test cases: In part-2, all instructions in the test cases under test/part2/ such as add, addi, auipc, beq, bge, (all branch instructions) jal, jalr instructions will be tested. To test all test cases together, use run_tests.sh part2, which will generate part2_results.log and part2_tests.log. Tests [7-9] are handwritten assembly code, which are easier to debug, so start with those. In part-2, we start to use modified RISC-V test cases. *.S is assembly code that takes RISC-V macro. Macros are defined at include/test_macros.h or include/riscv_test.h. It also uses ABI names and Pseudo Instructions. You can find a summary of information [here]. *.dump is an dump file output from gcc riscv compiler. *.mem file has the format for verilog code. *.dec file is useful when using [RISC-V emulator] What to submit: Submit the following to Canvas: Avoid procrastination; start early to manage the workload effectively. Part 3 (Optional) Complete the processor 1. [20pts] In this part, you will complete the processor to fully support the RISC-V ISA (except CSR instructions). Your goal is to ensure your program passes all the test cases in the test/part3/ directory. To receive full credits, your program must pass test/part3/testall.mem. Partial scores will be awarded based on the coverage of the Part 3 test suites. What to submit: Submit the following to Canvas: Useful Information References summary of RISC-V Assembly coding RISC-V emulator (tiny RV2) Verilator manual GTKWave manual Tutorial about RISC-V TEST SUITE FAQ for part 1 (Q) How do I run a specific test file? (A) Please see “define.vh”: you need to change line 21 to change which test file to read: `define IDMEMINITFILE “/home/zhifan/workspace/cs3220-23fall/lab1/test/part1/test4.mem”. You need to change “test4.mem” into (Q) Debugging takes so much time. Any tips to reduce the debugging time? (A) Some suggestions: 1. Review code carefully and understand the ISA behavior correctly. 2. If make command fails to compile, read the error messages carefully. 3. make command generates vcd file. Please use GTKWave to see important signals and check whether the signals works as expected according to *.asm files or RISC-V enumlators. When debugging, it is always helpful to visualize clk signal and pc values along with other important signals. (Q) How do I know whether my implementation is correct or not? (A) If you run make, you would see “Pass” message. (Q) Can I add new files? (A) Yes, but please make sure they are added in the zip file. (Q) Do we need to implement a branch predictor? (A) It’s not required for lab 1. (Q) Do we need to create a stack for nested JAL instructions? (A) The hardware does not know any nested calls, so you do not need to implement it. (Q) BEQ t1, t1, imm : if a branch is taken, is the new PC = PC + imm or new PC = PC + 4+ imm? (A) The answer is PC = PC + offset. Please be careful with converting imm to offset. (Q) Do we need to worry about whether we should prevent all writes to the zero register and treat it as always zero, or if that is solely up to us dependent on our design? (A) This is purely S/W job. The H/W doesn’t have to check whether x0 is writable or not. The Hardware also doesn’t have explicitly insert 0 in hardware. (Q) Is the immediate field inside assembly code decimal? (A) If the number starts with 0x, it’s hexadecimal. imem[PC_FE_latch[`IMEMADDRBITS-1:`IMEMWORDBITS]]; dmem[memaddr_MEM[`DMEMADDRBITS-1:`DMEMWORDBITS]]; (Q) What does assign inst_FE = imem[PC_FE_latch[`IMEMADDRBITS-1:`IMEMWORDBITS]]; mean? (A) PC_FE_latch contains PC value. Again imem and dmem are word addressable, so we don’t need LSB 2 bits. Since imem and dmem has only 2^14 size, we just use addr [15:2] bits to index imem/dmem. (Q) I’m not sure how to understand part 2 test code. (A) The test in test/part2 is modified code from RISC-V test suite. It uses macro function to generate test code. (Q) How do I know what is the correct instruction/code behavior? (A) You can probably use RISC-V enumlators or other RISC-V machine to execute the code. One example is here . (Q) How do I know whether I pass the code or not? (A) For part 1, we provide test code. Your code should print out “Pass” message if you run make. (Q) My code does not load any instructions. Do I need to change anything? (A) Carefully check if you encountered any error messages and make sure you have set IDMEMINITFILE to the right path. FAQ for part 2 (Q) what is li instructions in add.dump? (A) li instruction is one of the pseudo instructions. It is the same as addi x0, imm (Q) I passed test[1-5].mem. why do I fail addi.mem? (A) It contains bne, auipc, jal instructions. So in order to pass part 2 test cases, you need to complete those instructions. (Q) I’d like to use RISC-V emulator for testing the test code, but it won’t take dump file. what should I do? (A) Unfortunately RISC-V emulator only takes assembly instructions. Hence, we recommend to use another emulator . You can use *.dec file in this simulator. (Q) Behavior of lui. The documentation says that – Semantics : R[rd] = imm

$25.00 View

[SOLVED] Cs3210 – lab 1

Processes, Threads and Synchronization Basics Learning Outcomes1. Understand the differences between processes and threads 2. Use the POSIX thread (pthread) library for shared-memory parallel programming 3. Implement critical sections in the code 4. Apply basic synchronization constructs in programs 5. Start to become familiar with our lab machines You can obtain 2% of your grade in CS3210 by submitting your work at the end of the lab. ò Why Learn fork(), pthreads, etc?fork() / pthreads are relatively lower-level ways to create and synchronize processes and threads. However, it’s important to understand these intricacies before we explore the more abstracted and powerful libraries such as OpenMP / MPI. . Programming Language: C vs C++If you know C++, please do not use C++’s own std::thread, condvar/semaphore/mutex/unique lock, etc. in CS3210 unless specifically allowed. Please use pthreads. ò Logging in & Getting StartedFor the lab and assignments, you are going to be running your code on the machines in the Parallel and Distributed Computing Lab located in COM1-B1-02. Use the following instructions to connect to the lab machines remotely over ssh. Please follow https://nus-cs3210.github.io/student-guide/accessing/. For this lab, connect to one of the machines using the guide above, and start working on completing the tasks in the lab. The lab files can be found here: https: //www.comp.nus.edu.sg/~srirams/cs3210/L1_code.zip. You can use the command “wget” to download the code to the lab machine, and “unzip” to unzip the file. 1 Part 1: Processes vs. Threads Multi-process programming on Linux with C++Let us look at a simple program which demonstrates the use of processes in Linux. Open the ex1-processes.cpp file and study the use of the fork() system call and its return values. Note the wait(nullptr) call by the parent process. The purpose of this call is to make sure the parent process waits until all its child processes are completed. In a situation where the child continues to run after the parent process is completed (died), the child is called an orphan process. _• Compile the code in a terminal (console): > g++ -o processes ex1-processes.cpp • Run the program in a terminal: > ./processes x Exercise 1 Compile and run ex1-processes.cpp. Observe the output. Why is the line “We just cloned a process..!” printed twice? Fix the code such that the line only prints once. Creating and terminating threads1 for(size_t i = 0; i < NUM_THREADS; i++) 2 { 3 printf(“main thread: creating thread %zu “, i); 4 5 //pthread_create spawns a new thread and return 0 on success 6 rc = pthread_create(&threads[i], NULL, work, (void *)i); 7 } Listing 1: Snippet of ex2-threads.cpp ex2-threads.cpp contains a simple example on creating (spawning) threads with the pthread library and terminating them. In ex2-threads.cpp, the loop runs NUM THREADS number of times and calls the pthread create function to create/spawn new threads. pthread create takes in four arguments: 1. thread – Reference to a thread variable of type pthread t (element in threads array in this example) 2. attr – Thread attributes 3. start routine – The function to be executed by the newly spawned thread (function work in this example) _• To find out more about different C++ functions, you can use the man (manual) command in the terminal (console): > man pthread_create • Compile the code in a terminal: > g++ -pthread -o threads ex2-threads.cpp • Run the program in a terminal: > ./threads x Exercise 2 Compile ex2-threads.cpp and run the program. Observe the output. Modify the NUM THREADS value and observe the order of thread execution. Do threads execute in the same order they are spawned each time the program runs? Is the final value of the variable counter always the same? Explain. Part 2: Process and Thread Synchronization A critical section is a section of code that uses mutual exclusion to ensure that: • Only one thread at a time can execute in the critical section • All other threads have to wait on entry • When a thread leaves a critical section, another can enter A race condition happens when two concurrent threads (or processes) access a shared resource without any synchronization. Race conditions arise in software when an application depends on the sequence or timing of processes or threads for it to operate correctly. Process Synchronization with Semaphores _ • Compile the code in a terminal: > g++ -pthread -o semaph semaph_named.cpp • Run the program in a terminl: > ./semaph Pitfalls: Named vs Unnamed Semaphores Notice that we did not explicitly share our semaphore (sem) between parent and child processes. sem is shared correctly across all our processes because we used named semaphores through the POSIX sem open library call. This automatically causes sem to be in a shared memory region. If we used unnamed semaphores through the POSIX sem init library call, we would have to allocate the semaphore within shared memory ourselves. See man sem overview. Read semaph shm.cpp to see the changes required for unnamed semaphores. Thread Synchronization with Mutexes and Condition Variables _• Compile the code in a terminal: > g++ -pthread -o race ex3456-race-condition.cpp • Run the program in a terminal: > ./race x Exercise 3 Compile ex3456-race-condition.cpp and run the program. Observe the output. pthread join is a pthread library function which guarantees the caller thread that the target thread is terminated. In the program ex3456-race-condition.cpp, if the main thread calls pthread join for all the ADD and SUB threads before printing the final result of the global variable, we should see the real final value after all ADD and SUB threads are completed. int pthread_join(pthread_t thread, void **retval); Ð pthread_join(thread, NULL); // example x Exercise 4 Modify ex3456-race-condition.cpp (new name ex4-race-condition.cpp) to ensure that all ADD threads and SUB threads complete before printing the final result. Compile, run, and observe the output. (run multiple times to see if the output is consistent) Mutexes A mutex is a synchronization construct which is used to control access to a critical section in the code. A mutex variable acts like a lock and the thread that acquires the thread gets to access the critical section. Once a thread has acquired a mutex lock to a critical section, no other thread can acquire it until the first thread releases the mutex. pthread mutex example pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_lock(&lock); // critical section here pthread_mutex_unlock(&lock); x Exercise 5 Modify ex3456-race-condition.cpp (new name: ex5-race-condition.cpp) by adding a mutex variable to control access to the global counter. Compile, run, and observe the output. (Run multiple times to observe if the output is consistent!) What do you think are the differences between a pthread mutex and a binary semaphore? Condition variables Mutexes provide a mechanism for controlling access to a critical section to prevent races. However, they cannot be used for threads to wait until another thread completes some arbitrary task. Condition variables provide a mechanism for threads to be signaled by other threads rather than continuously polling to check if a certain condition has been met. Condition variables are used in association with mutex variables. Related pthread functions are: Е Create and destroy pthread_cond_init(condition,attr), pthread_cond_destroy(condition) • Waiting and signaling: pthread_cond_wait(condition,mutex), pthread_cond_signal(condition), pthread_cond_broadcast(condition) Download and study ex6-cond-example.cpp which demonstrates the use of condition variables. The main thread creates three threads. Two of those threads increment a “count” variable, while the third thread watches the value of “count”. When “count” reaches a predefined limit, the waiting thread is signaled by one of the incrementing threads. The waiting thread “awakens” and then modifies count. The program continues until the incrementing threads reach TCOUNT. The main program prints the final value of count. x Exercise 6 Modify ex3456-race-condition.cpp (new name: ex6-race-condition.cpp) using condition variables to prevent SUB threads from executing until all ADD threads are completed. [ Further reading and examples: https://computing.llnl.gov/tutorials/pthreads/ Part 3: Producer-Consumer Problem (to be submitted) In this part, we combine the first two parts to solve the producer-consumer problem using both (i) processes and semaphores, and, (ii) threads, mutexes and condition variables. x Exercise 7 Implementing the same producer consumer logic with processes involves allocating memory from the kernel space as a means of maintaining a global variable (for inter-process communication). Refer to the example which uses shared memory with processes in semaph named.cpp. x Exercise 8 Implement the exercise above but using processes and semaphores only (i.e.,, no pthreads, condition variables, etc). Name your program ex8-prod-con-processes.cpp. The very basic approach of your program should be as follows: // allocate shared memory // allocate semaphores if (fork() == 0) producer(); // producer 1 if (fork() == 0) producer(); // producer 2 consumer(); // cleanup shared memory x Exercise 9 Limit the total number of items produced/consumed to a sufficiently-large fixed value (to observe the performance of the programs accurately) and measure the time taken to complete the program for both cases (processes and pthreads). Then, vary this limit on the total number of items produced. Comment on the observations for your threads and processes implementations in exercises 7 and 8 (maximum length: 1 paragraph). Pitfalls: Correctly exiting multi-threaded / multi-process programs • The signal function (man 2 signal), and what code can run safely in a signal handler. • The pthread sigmask function (man 3 pthread sigmask). • How to indicate to running processes that they should exit. • How to ensure processes do not deadlock when trying to exit. -Lab sheet (2% of your final grade): • Your code for the producer and consumer functions in ex7-prod-con-threads.cpp and ex8-prod-con-processes.cpp. • Your answer for exercise 9. Please use a legible monospace font (e.g. 11-point Consolas) with single line spacing for your code. Your answer for exercise 9 should also be in a legible font (no smaller than 11-point Arial). Appendix: Debugging Viewing Processes and ThreadsTo view the running processes and threads in a Linux console, we can use ps and top/htop commands. These commands should be invoked separately in a different terminal window. To see a list of processes running on your system details, run any of the following commands in a terminal: • > ps -ef • > ps -A • > top • > htop If too much information is printed and impossible to read at one time, you can pipe the output through the less command to scroll through them at your own pace: > ps -A | less If you are looking for a specific process, e.g., bash, you can do > ps -A | grep bash More information on ps: http://man7.org/linux/man-pages/man1/ps.1.html or type in man ps in the console. To list individual threads under each process: > top -H More information on top: http://man7.org/linux/man-pages/man1/top.1.html or type in man top in the console. To kill a running process use either one of these commands: • > kill -p • > pkill • > killall Debugging C / C++ ProgramsThere are multiple debugging tools available for debugging C programs. The gdb debugger is a command line debugger for C (and many other languages). To use the gdb debugger, we need to compile the source code with -g compiler flag. (When you compile with -g, the compiler includes debugging information in the binary, making it easier for gdb to find bugs.) gdb provides debugging features such as breakpoints, step execution, and, examining the call stack. >g++ -g -o prog prog.cpp > gdb prog • Run the program inside gdb > run • Official gdb documentation https://ftp.gnu.org/old-gnu/Manuals/gdb/html_node/gdb_toc.html Valgrind is a more advanced profiler which helps us debug applications as well as detect performance issues. It includes advanced features such as detecting race conditions and false sharing.

$25.00 View

[SOLVED] Cs220

Homework 5: Machine LanguageObjective: Build the two Assembly Language programs described below, which will test your understanding of Assembly programs for our HACK architecture. Highly recommend you go through the simulator tutorial for assembly programs before attempting the homework.Grading method:Div.asm: In this program, you will be implementing the division operation by successively subtracting the dividend and divisor to reach a quotient result. Write an assembly program to perform the division of two integers (stored at R0 and R1) and store the result in R2. You can perform multiplication through successive subtraction. For example:16 / 3 = 16 – 3 – 3 – 3 – 3 – 3 = 0 (remainder 1)a. First write Java code using a loop to perform the successive subtraction. Assume the following: int R0 = 16; int R1 = 3; int R2 = 0; // R2 holds the result of R0 / R1 (after using successive subtraction in a loop) // You do not have to submit the Java code, it is meant to help you translate into assembly.b. Next, convert the high-level Java code into assembly code (named div.asm). This is the code you will copy/paste to the Word document you submit. Please ensure *every* section of Assembly code is documented with a comment.CS220CS220.asm:Here are two screen shots indicating the expected output:a) Regular Challengeb) Extra-Credit Challenge w/ Centering (+ 20 points extra credit)CS220What do you turn in? Create one Word document (or PDF) with the following in order: 1. The Div.asm source code (make sure to comment every section of code) 2. A screen shot (entire window) containing the output from running the Div.asm code with R0 initialized to 16 and R1 initialized to 3. The correct result of 5 should be shown in R2. Be sure to comment your code and end the program with an infinite loop to prevent NOOP slides. 3. The CS220.asm source code (make sure to comment every section) 4. A screen shot containing the output from running the CS220.asm code in the CPUEmulator (take screen shot of entire window).Program Working? Well built? Documentation? Div.asm 20 / 15 / 5 CS220.asm 40 / 15 / 5 Extra Credit 20 Subtotal 60 / 30 / 10

$25.00 View