Programming lesson
Building a Token Scanner and CSV Parser in Java: ECS140A Project 1 Guide
Learn how to implement a PeekableCharacterStream, Scanner, and CSVParser in Java for ECS140A Project 1. This tutorial covers tokenization rules, CSV parsing, and practical coding tips.
Introduction to ECS140A Project 1
In this tutorial, we'll walk through the core concepts of building a programming language scanner and a CSV parser in Java, as required in ECS140A Project 1. This project is the foundation for subsequent assignments, so understanding the tokenizer and CSV parser is crucial. We'll focus on the design and implementation strategies without giving away the complete solution. By the end, you'll be able to create a PeekableCharacterStream, a Scanner that produces tokens, and a CSVParser that maps CSV rows.
Understanding the PeekableCharacterStream Interface
The first task is to implement the PeekableCharacterStream interface for a FileInputStream. This interface provides methods to peek ahead without consuming characters, which is essential for tokenization. Think of it like browsing a playlist on a music streaming app: you can see the next song without playing it, or skip ahead to preview tracks. Similarly, your stream lets the scanner look ahead to determine token boundaries.
Key methods include:
moreAvailable()– checks if more characters exist.peekNextChar()– returns the next character without consuming it.peekAheadChar(int ahead)– returns the character at a given offset.getNextChar()– consumes and returns the next character.close()– closes the stream.
Your implementation can use a BufferedReader internally to read the file efficiently. Store a small buffer of peeked characters to support lookahead. This is similar to how a video streaming service buffers a few seconds ahead for smooth playback.
Building the Scanner: Tokenizing Input
The Scanner class takes a PeekableCharacterStream and a list of keywords. It produces tokens using peekNextToken() and getNextToken(). The token rules define identifiers, operators, integer and float constants, string constants, and whitespace skipping. A key nuance: a negative sign before a constant becomes an operator if the previous token was a constant or identifier. For example, in A -5, the minus is an operator, not part of the number.
To implement the scanner, you'll need to handle each token type sequentially. Use a state machine approach: read characters until you can determine the token type. For identifiers, start with a letter or underscore, then consume letters, digits, or underscores. For numbers, handle optional sign, digits, and decimal point. For strings, consume until an unescaped double quote.
Invalid tokens occur when an underscore or letter immediately follows a constant, or when invalid characters appear. For strings, invalid characters are consumed until the closing quote or end of stream. This is like a spell-checker that flags unrecognized words but continues scanning.
Example: Tokenizing a Simple Expression
Consider input count = 10. The scanner should produce: Identifier "count", Operator "=", IntConstant "10". If the input were count -5, the tokens would be: Identifier "count", Operator "-", IntConstant "5". Note that the minus is separate because the previous token was an identifier.
You can test your scanner with a sample file containing various token types. Use the main method to output tokens in a readable format, like Token type: value.
Developing the CSVParser
The CSVParser class parses CSV files using the same PeekableCharacterStream. It returns Map<String, String> for each row, where keys are column headers from the first row. CSV rules require a header row with no duplicate or empty columns. Rows are terminated by newline, columns by comma. Whitespace inside a column must be quoted, and double quotes are escaped by doubling them.
Empty or missing columns map to null. A data row can have fewer columns than the header, but not more. Your parser must handle quoted fields that may contain commas, newlines, or double quotes. This is similar to parsing a leaderboard from a game tournament, where each row has player stats and missing data is null.
Implementation Tips for CSVParser
First, read the header row by scanning characters until newline, splitting by commas while respecting quotes. Store the headers in a list. Then, for each subsequent row, read until newline, split similarly, and create a map from headers to values. If a row has fewer values, set missing columns to null. Use a temporary buffer for quoted fields and handle escape sequences.
Your main method should take a filename, create a PeekableCharacterStream from it, and use the CSVParser to print each row's map. For example, for a CSV with headers "Name","Score", a row "Alice,95" would output {Name=Alice, Score=95}.
Common Pitfalls and How to Avoid Them
One common mistake is not properly handling the negative sign rule. Remember to track the previous token type. Another is failing to skip whitespace correctly, especially between tokens. For CSV, forgetting that quoted fields can contain newlines will break your parser. Also, ensure your PeekableCharacterStream correctly returns -1 at end of stream.
Test edge cases: empty file, file with only header, file with quoted fields containing commas, and files with missing columns. Use the provided Scanner.sh and CSVParser.sh scripts to compare your output with the solution.
Connecting to Real-World Trends
Tokenization and parsing are fundamental to compilers and interpreters. They're used in AI tools like GPT models to tokenize text input. In finance, CSV parsers handle stock trade data. In gaming, leaderboards and replay files use similar parsing. By mastering this project, you're building skills used in modern software development, from chatbots to data analysis.
Final Checklist for Submission
- Implement
PeekableCharacterStreamforFileInputStream. - Implement
Scannerwith proper tokenization rules. - Implement
CSVParserwith correct CSV format handling. - Include a
mainmethod in both classes for testing. - Create a
MakefileandREADME.txt. - Submit a
.tgzarchive aftermake clean.
Remember to cite any external code sources in your README and comments. Good luck with your ECS140A project!