Overview

Submitty provides a number of utilities for analysis of student code through the assignment autograding configuration interface. Many simple use-cases can be addressed using submitty_count, which allows an instructor to count occurrences of a variety of syntactic features within student code.

To use submitty_count, simply invoke it as a command within the config.json file for a given assignment, supplying the type of feature to count, the feature itself, any number of source files, and optional configuration flags. For example:

submitty_count --language python call print *.py

Note: submitty_count is an alias for a program installed on the submitty server. You can directly run the command to see how it works. Here’s the same example:

/usr/local/submitty/SubmittyAnalysisTools/count --language python call print *.py

This example will output the number of calls to the function print in all of the Python source files in the current directory. Another example:

submitty_count -l c token Goto main.cpp

This second example will output the number of occurrences of the token goto in the C/C++ source file main.cpp.

The comment_count tool in Submitty counts the number of comments in the student code. Example usage:

comment_count *.py

It is also possible to provide a list of files written in different programming languages.

comment_count *.py *.cpp

Here are a couple sample configurations:
Tutorial Example: 04 Python Static Analysis
Tutorial Example: 05 C++ Static Analysis

Countable Features

Currently, three feature types can be counted: tokens, nodes, and function calls. The countable features contained in a given file can be identified using submitty_diagnostics, for example as follows:

/usr/local/submitty/SubmittyAnalysisTools/diagnostics -l python file.py

This tool outputs JSON data by default. An interactive view of the data can be produced by specifying HTML format:

/usr/local/submitty/SubmittyAnalysisTools/diagnostics -l python --format html file.py

For example, if you would like to count additions, but are unsure of which token to count, you could use a test file like:

# file.py
print(1 + 1)

Running /usr/local/submitty/SubmittyAnalysisTools/diagnostics -l python file.py on this file will produce the following output:

{
    "/absolute/path/to/file.py": {
        "tokens": [
            {
                "end_col": 6,
                "token": "Identifier",
                "start_line": 2,
                "start_col": 1,
                "end_line": 2
            },
            {
                "end_col": 7,
                "token": "LeftParen",
                "start_line": 2,
                "start_col": 6,
                "end_line": 2
            },
            {
                "end_col": 8,
                "token": "IntegerLiteral",
                "start_line": 2,
                "start_col": 7,
                "end_line": 2
            },
            {
                "end_col": 10,
                "token": "Plus",
                "start_line": 2,
                "start_col": 9,
                "end_line": 2
            },
            {
                "end_col": 12,
                "token": "IntegerLiteral",
                "start_line": 2,
                "start_col": 11,
                "end_line": 2
            },
            {
                "end_col": 13,
                "token": "RightParen",
                "start_line": 2,
                "start_col": 12,
                "end_line": 2
            }
        ],
        "nodes" : { ... node data here ... }
    }
}

The token fields specify tokens that can be given to submitty_count. Notice that a token Plus is present between two IntegerLiteral tokens. You could verify that this is the right token by looking at the start_line, end_line, start_col, and end_col fields, which indicate on what row and column the tokens begin and end within the file. Once you are sure that the token is correct, you could count it within student submissions with submitty_count:

submitty_count -l python *.py

Tokens

A token is a representation of a syntactic feature as a member of a set of categories. Within Submitty, we discard almost all other data except for this category, allowing many difficult parts of source code analysis to be superseded. For example, imagine a scenario where an instructor would want to count the number of uses of goto in a C program. Take the following example of student code:

/* Assignment 1: Don't use goto! */
#include <stdio.h>
int main() {
    int foo = 1;
    printf("I'm not using goto ");
}

The use (or lack thereof) of goto could certainly be detected by means of, say, simple regular-expression based search, but it would be difficult to handle the cases where goto is used inside a comment or string. Contrast this to the token-based search approach. The previous code fragment tokenizes to the following:

Int Identifier LeftParen RightParen LeftCurly
Int Identifier Equals IntegerLiteral Semicolon
Identifier LeftParen StringLiteral RightParen Semicolon
RightCurly

In this representation, it is very easy to determine that goto is not being used. Contrast this to the following:

int main() {
foo:
    goto foo;
}

This would tokenize into:

Int Identifier LeftParen RightParen LeftCurly
Identifier Colon Goto Identifier Semicolon
RightCurly

Here, the use of goto is immediately apparent given the presence of the Goto token.

Counting tokens handles many common automatic grading scenarios, and should be the first tool considered when writing an assignment that requires static analysis. Only seek out more advanced options when necessary.

Nodes

The next level of analysis enables counting nodes within a parse tree, which is a translation of the textual source into a tree structure. Within Submitty, we assign each node in the parse tree some number of textual tags. For example, this code fragment:

while True:
    1 + 1

parses to the following tree:

   Node(Tag "while", Tag "loop")
   ├── Node(Tag "literal", Tag "boolean")
   │   └── DataBool True
   └── Node(Tag "plus", Tag "add", Tag "+")
       ├── Node(Tag "literal", Tag "integer")
       │   └── DataInt 1
       └── Node(Tag "literal", Tag "integer")
           └── DataInt 1

Notice here that in addition to the hierarchical structure of the nodes, there is also a generally hierarchical structure to the tags: boolean and integer literals both share the “literal” tag, but both also have a more specific tag denoting what kind of literal is present. This enables the counting of specific classes of node. For example:

submitty_count -l python literal *.py

If run upon the code fragment from the start of this section, this will yield 3, counting all literals used within the code. Contrast: which will return 3. In contrast:

submitty_count -l python integer *.py

will return 2, as it will only count the integer literals.

Distinctions of this kind are not possible with token counting, which only cares about the actual textual form of a token. Node counting can also be used to differentiate between different uses of the same token. For example, in Python the For token is used for both the for loop and the list comprehension. Since the same For token is present regardless of which of these features is used, it is not possible to distinguish them using a token counting approach. However, these features have different nodes in the parse tree, so by counting nodes with certain tags it is possible to easily distinguish them.

Function Calls

This method is a bit higher level: it attempts via a language-dependent method to detect a call to a function with a particular name. This is more easily “tricked” than the other methods, especially in languages with first-class functions like Python, but still a useful tool. A common case using this method at RPI is determining the number of calls to the print function present in Python code, for example:

submitty_count call print -l py *.py

Instructor > Autograding > Static Analysis

Overview

Countable Features

Tokens

Nodes

Function Calls