KEDIT: Language Definition Files

KEDIT Language Definition Files

KEDIT's syntax coloring facility uses different colors to highlight comments, strings, keywords, and other items in programs that you are editing. The rules that KEDIT uses to determine which text to treat as part of a comment, a string, a keyword, etc., are specified in special files, called KLD (``KEDIT Language Definition'') files, that are described in this chapter.

Loading KLD Files

KEDIT Language Definition files are loaded via the SET PARSER command:

[Set] PARSER parser fileid

Use the parser operand to specify the name of the parser you want to define.

The fileid operand specifies a file, with a default extension of .KLD, containing your language definition. KEDIT searches for the .KLD file in the same directories it uses when searching for macro files, as controlled by SET MACROPATH.

For example, if you were working with a hypothetical language called LANG and you had described the language in a KEDIT Language Definition file called LANGDEF.KLD, you could define a parser called LANG with the command

SET PARSER LANG LANGDEF.KLD

After issuing the SET PARSER command, you could then issue the command

SET COLORING ON LANG

to use this parser to control syntax coloring for the current file.

If files in your language always had an extension of, for example, .LNG, you could use the SET AUTOCOLOR command to tell KEDIT to always use the LANG parser for .LNG files:

SET AUTOCOLOR .LNG LANG

SET PARSER commands are typically executed from your KEDIT profile when KEDIT is initially loaded. For example:

* if first profile execution in a session,
* setup the LANG parser and then
* cause all .LNG files to be colored using the LANG parser
if initial() then do
    'set parser lang langdef.kld'
    'set autocolor .lng lang'
    end

Several language definitions are built into KEDIT, and when KEDIT is loaded it automatically issues SET PARSER commands that use these language definitions to set up its default parsers. See the description of the SET PARSER command for a complete list of built-in parsers. To distinguish these internal language definition files from actual disk files, KEDIT uses an asterisk as the first character of their names. For example, the command

SET PARSER C *C.KLD

tells KEDIT to use *C.KLD as the Language Definition File associated with the C parser. The asterisk in the name tells KEDIT to use the special file *C.KLD, which is built into KEDIT, and not to look for the file on disk.

Copies of all of the KLD files built into KEDIT are included in the SAMPLES subdirectory of the main KEDITW directory. For example, there is a C.KLD file that is an exact copy of the *C.KLD file that is built into KEDIT. If you modify one of these copies you should save it in a different location (normally the USER subdirectory of the main KEDITW directory) and load it by issuing a SET PARSER command referring to the modified file.

Note that whenever you issue the SET PARSER command, the KLD file that you specify is loaded into memory, even if an identical SET PARSER command has previously been issued. This makes it easy to develop and test modifications to KLD files, because if you make changes to a KLD file you can simply reissue the appropriate SET PARSER command and KEDIT will load the updated version of the file. Any files whose syntax coloring is controlled by your parser will automatically be re-colored, so you can easily see the effect of the changes you have made to the KLD file.

KLD File Format

Here is a description of the format of KEDIT Language Definition files, which usually have an extension of .KLD. The best way to get started with KLD files is to look over this description briefly, and then to examine some of the KLD files that are included in the SAMPLES directory of the main KEDITW directory.

The rules given here for KLD files are flexible enough to describe a number of popular programming languages, to handle varying syntax conventions for comments, strings, numbers, etc., and to have user-configurable lists of keywords. The goal is to handle many common language variants with a relatively small number of parameters.

KLD files are divided into sections. Each section begins with a section header, consisting of a colon in column one followed immediately by the section name. Following each section header line are one or more lines of parameter information.

To improve readability, you can insert blank lines at any point in a KLD file. Additionally, any line whose first nonblank character is an asterisk (``*'') is considered a comment line and is ignored by KEDIT. For example:

* Sample KLD contents
:case
 ignore

:identifier
 [a-z] [a-z0-9]

:keyword
 if
 then
 else

The above example starts with a comment line, followed by a :CASE section with one parameter line, an :IDENTIFIER section with one parameter line, and a :KEYWORD section with three parameter lines. Parameter information is usually indented from column one, as in this example, but it does not have to be.

Here are descriptions of each kind of KLD file section:

:CASE section

The :CASE section consists of a single line with the word RESPECT or the word IGNORE. RESPECT means that the language you are describing is case-sensitive (for example, ``else'' and ``ELSE'' are not considered identical), and IGNORE means that the language is case-insensitive.

An example:

:CASE
 respect

If the :CASE section is omitted, KEDIT assumes case insensitivity. If present, the :CASE section must precede the :IDENTIFIER section.

:OPTION section

The :OPTION section consists of a single line containing special options that are needed to properly process some languages. There are currently two possible options:

PREPROCESSOR char

PREPROCESSOR indicates that the language supports a C-like preprocessor mechanism, and that preprocessor keywords are preceded by the specified character. For example:

:OPTION
 preprocessor #

REXX

REXX indicates that the REXX language is being described. In REXX, certain identifiers are sometimes considered keywords and are sometimes considered variables, depending on the context in which they are used, and the REXX option tells KEDIT to do the special processing that this requires.

If the :OPTION section is omitted, KEDIT does not do special handling of preprocessor keywords or of REXX keywords.

:IDENTIFIER section

The :IDENTIFIER section consists of a single line that specifies what characters can appear within identifiers in the language you are describing. These characters are specified in the same way as character class specifications within KEDIT regular expressions. They consist of lists, enclosed in square brackets, of valid characters and/or ranges of valid characters (with the first character in the range, a minus sign, and the last character in the range). For example,

:IDENTIFIER
 [a-zA-Z]

specifies that any set of alphabetic characters is a valid identifier.

In many languages, there are different rules for what is valid as the first character of an identifier and for what is valid in additional characters in an identifier. To handle this situation, you can include two identifier specifications: first specify what is valid as the first identifier character and then specify what is valid in the remaining characters. For example, in C programs the first character of an identifier can be any alphabetic character or can be an underscore, while the remaining characters of an identifier can be alphabetic or can be underscores, but can also be numeric digits:

:IDENTIFIER
 [a-zA-Z_]    [a-zA-Z0-9_]

In some cases (BASIC programs are the main example), the last character of an identifier can be a special character that is not valid elsewhere in an identifier. For example, in BASIC, ABC@ is a valid identifier. To handle this, you can include a third item specifying the special characters acceptable only at the end of an identifier. For example:

:IDENTIFIER
 [a-zA-Z]  [a-zA-Z0-9_]  [%&!#@$]

The :IDENTIFIER section is required if you will be using the :KEYWORD section to give a list of the keywords in your language. The :IDENTIFIER section must appear before the :KEYWORD section.

:COMMENT section

Use the :COMMENT section to describe the rules for comments in your language. Each line of the :COMMENT section describes one type of comment; since some languages have multiple methods for specifying comments, there may be multiple lines in the :COMMENT section.

Some languages have single-line comments, which are introduced by some type of comment delimiter and cannot continue for multiple lines. Some languages have comments with both a starting and an ending delimiter. This kind of comment can usually continue for multiple lines, but in some languages may be restricted to a single line.

For example, C++ allows comments that are introduced by a pair of slashes (``//'') and continue until the end of the line. C++ also allows comments that can continue for multiple lines, introduced by a slash-asterisk pair (``/*'') and terminated by an asterisk-slash pair (``*/''). The corresponding :COMMENT section would be:

:COMMENT
 line     //     any
 paired   /*  */ nonest

Line comments are described using the format

LINE delim ANY|FIRSTNONBLANK|COLUMN n

where delim is the comment delimiter, which is followed by an indication of when the comment delimiter takes effect:

ANY

indicates that appearance of the comment delimiter anywhere on a line (except within a quoted string) starts a comment.

FIRSTNONBLANK

indicates that the comment delimiter starts a comment only if it is the first nonblank item on a line.

COLUMN n

indicates that the comment delimiter starts a comment only if it appears in column n of a line.

Comments with both starting and ending delimiters are described using the format

PAIRED delim1 delim2 [NEST|NONEST] [MULTIPLE|SINGLE]

where delim1 is the delimiter that starts a comment and delim2 is the delimiter that ends a comment.

NEST|NONEST

NEST indicates that multi-line comments can be nested inside multi-line comments, with the comments ending only when as many comment end delimiters as comment start delimiters have been encountered. NONEST is the default and indicates that comments cannot be nested, and that a comment ends as soon as the next comment end delimiter has been encountered. For example, consider

/*
/* here is a comment */
x = 17
*/

In the REXX language, which allows nested comments, ``x=17'' would be considered part of a comment. In the C language, which does not allow nested comments, ``x=17'' would not be considered part of a comment, and the final ``*/'' in the example would be invalid.

MULTIPLE

indicates that the comments can continue for multiple lines; this is the default and need not be specified.

SINGLE

indicates that, even though paired delimiters are being used, the comments must begin and end on a single line.

:HEADER section

The :HEADER section describes header lines. Header lines are used to indicate the start of a new section in certain types of files; the section headers in .KLD files are examples of header lines.

Header lines are specified in the same way as single-line comments:

LINE delim ANY|FIRSTNONBLANK|COLUMN n

As far as KEDIT's syntax coloring is concerned, the only difference between single-line comments and headers is that comments are displayed using ECOLOR A and headers are displayed using ECOLOR G. An example of a :HEADER section that describes .KLD file section headers:

:HEADER
 line : column 1

:STRING section

Use the :STRING section to describe the types of quoted strings used in your language. Each line of the :STRING section describes one type of string; since some languages have multiple methods for specifying strings, there may be multiple lines in the :STRING section. There are three possibilities:

SINGLE

This means that your language uses strings enclosed in single quotes.

DOUBLE

This means that your language uses strings enclosed in double quotes.

DELIMITER c

Use this to specify that the character c is the string delimiter for your language.

SINGLE, DOUBLE, and DELIMITER c can optionally be followed by the word BACKSLASH, the word MULTILINE, or both. BACKSLASH indicates that, as is the case in the C language, the backslash character serves as an escape character within strings and that quote characters following a backslash do not terminate a string. MULTILINE indicates that strings need not begin and end on the same line, but can continue across end-of-line boundaries.

If the :STRING section is omitted, KEDIT's syntax coloring does not recognize any strings in your files.

:NUMBER section

Use the :NUMBER section to indicate the format of numbers in your language. The :NUMBER section is a single line long, with the word INTEGER, DECIMAL, C, COBOL, PASCAL, REXX, or ADA.

INTEGER means that numbers consist of strings of digits.
DECIMAL means that numbers consist of strings of digits and periods.
C is used for C language numbers. These can be integers, decimal numbers, or numbers in exponential notation, like 12.4E-2. Several other languages use numeric formats that are similar to those used by C.
COBOL is used for COBOL language numbers, which consist of digits and decimal points, except that trailing decimal points are not counted as part of a number, and digits immediately followed by COBOL identifier characters (for example, 1234-TEST) are not counted as numeric.
PASCAL numbers are like C language numbers, except that they cannot start with a decimal point. Also, hexadecimal values (for example, $abcd) are treated as numeric.
REXX handles REXX language constant symbols, which include REXX numbers and symbols like 12ABC and .XYZ.
ADA numbers are like C language numbers, except that underscores are allowed within the numbers. If the :NUMBER section is omitted, KEDIT's syntax coloring does not recognize any numbers in your files.

:LABEL section

Use the :LABEL section to define what counts as a label in your language. The label section normally consists of a single line, but can involve multiple lines if your language has multiple ways of specifying labels. The label description has the format

DELIMITER delim FIRSTNONBLANK|ANY|COLUMN n

where delim is the delimiter that must follow the label and FIRSTNONBLANK indicates that the label must be the first nonblank item on a line, ANY indicates that the label can appear anywhere on a line, and ``COLUMN n'' indicates that the label must begin in column n of a line.

Instead of a DELIMITER line, you can specify

COLUMN n

to indicate that any non-keyword identifier beginning in the specified column should be treated as a label, with no need for a delimiter following the label.

:MATCH section

Use the :MATCH section to specify the matching characters and identifiers that indicate nested structure within your language. For example, in most languages, left and right parentheses can be nested and must match up properly in a syntactically correct program. In some languages the same is true of keywords like BEGIN and END.

KEDIT's syntax coloring facility uses the information in the :MATCH section for two purposes:

First, items at different nesting levels are colored differently, so you can easily see which items match. For example, in the line

if (f(x + y + z) = 17)

KEDIT can display the inner parentheses and the outer parentheses in different colors.

Second, when you use the CMATCH command (assigned by default to Shift+F3) to find the matching item for the text at the cursor position, KEDIT can properly match any items described in the :MATCH section. With the cursor on the first DO in the following example, Shift+F3 can move the cursor to the second END in the example:

if a = 5 then do
  j = 17
  do i = 1 to 10
   say i*j
  end
end

Each line of the :MATCH section has either two or three items. The first item specifies the identifiers or character sequences that introduce a matchable construct. The second item specifies the identifiers or character sequences that end a matchable construct. The third item is optional, and is used to specify items that always appear inside of a matchable construct.

For example,

:MATCH
  (   )
  {   }
  #if  #endif  #else

Here, three matchable constructs are specified:

The first specifies that left parentheses will be matched with corresponding right parentheses.
The second specifies that left braces will be matched with corresponding right braces.
The third specifies that, as in the C preprocessor language, #if is matched with #endif and that within an #if/#endif construct there may be an #else item that should be colored in the same way as the corresponding #if and #endif. KEDIT actually uses the following :MATCH section in its default C language parser:
```
:MATCH
 (     )
 {     }
 #ifdef,#if,#ifndef   #endif   #else,#elif,#elseif
```
This is because any of #ifdef, #if, and #ifndef can match up with #endif, with any of #else, #elif, and #elseif allowed between them. As in this example, you can specify multiple equivalent items in a :MATCH section, separated by commas. Some notes on using the :MATCH section:
The current scheme for handling matched items works only for items that do not contain blanks. That is, #if and #endif pairs or BEGIN and END pairs can be matched, but WHILE and END WHILE or IF and END IF cannot be handled, since they contain blanks.
An identifier or character sequence should only appear once in the :MATCH section; any additional occurrences of the same item will have no effect. So, for example, in a language that has DO -- END and BEGIN -- END constructs, you should not use
```
:MATCH
 DO END
 BEGIN END
```
but should instead use
```
:MATCH
 DO,BEGIN END
```
Any identifiers included in :MATCH specifications must also appear in the :KEYWORD section, or they will be ignored. If the :MATCH section is omitted, KEDIT's syntax coloring facility does not recognize any matchable constructs in your files.

:KEYWORD section

Use the keyword section to specify the keywords in your language. Each line of the keyword section has the form

keyword [ALTERNATE n] [TYPE m]

where keyword must be a valid identifier in your language. (If you specified PREPROCESSOR in the :OPTION section, you can also include preprocessor keywords, which must consist of the preprocessor character followed by a valid identifier.)

Keywords are normally colored according to the current ECOLOR D setting, and preprocessor keywords according to the current ECOLOR F setting. It is sometimes useful to specify different types of keywords that will be colored differently. To do this, you can specify

ALTERNATE n

following a keyword, where n is a number from 1 through 9. When ALTERNATE 1 is specified, ECOLOR 1 is used to color the keyword; when ALTERNATE 2 is specified, ECOLOR 2 is used, etc.

TYPE m

is used only when REXX has been specified in the :OPTION section, and determines what to treat as a REXX keyword, subkeyword, etc. The number m is determined as follows: start with m equal to 0, then add 1 for a REXX keyword, add 2 for a REXX subkeyword, and add 4 for a REXX keyword that takes subkeywords. For example, SAY is a keyword that does not take subkeywords, so it is TYPE 1. ARG is a REXX keyword, is also a REXX subkeyword (as in PARSE ARG), and it takes subkeywords (as in ARG UPPER), so it is TYPE 7. For further examples, see the REXX.KLD file in the SAMPLES subdirectory of the main KEDITW directory.

A sample :KEYWORD section:

:KEYWORD
  if
  then
  else
  do
  end
  switch
  for
  procedure  alternate 1

If the :KEYWORD section is omitted, KEDIT's syntax coloring facility does not recognize any keywords. If the :KEYWORD section is specified, it must be preceded by the :IDENTIFIER section.

:MARKUP section

The :MARKUP section is used with HTML and similar markup languages. It can contain a TAG line and, optionally, a REFERENCE line.

Use the TAG line to specify the character string that initiates a markup tag and the character string that terminates a markup tag.

In an HTML file, where a typical line of text might be:

<H1>Level 1 header</H1>

``<'' initiates a tag, and ``>'' terminates it. This would be specified in the :MARKUP section as

:MARKUP
 TAG <  >

Use the REFERENCE line to specify the character string that initiates a character or entity reference and the character string that terminates it.

HTML lets you use entity references like ``<'' or character references like ``<'' to refer to special characters. These references begin with an ampersand (``&'') and end with a semi-colon (``;''). This would be specified in the :MARKUP section as:

:MARKUP
 TAG       <  >
 REFERENCE &  ;

The following special rules apply if your KLD file contains a :MARKUP section:

Tags are highlighted, using ECOLOR T. For example, in the line
```
<P>This is a new paragraph.
```
``<P>'' would be highlighted.
Quoted strings within tags are highlighted using ECOLOR B. For example, in
```
<A HREF="film_clip.jpg">
```
the quoted string is displayed using ECOLOR B, while the rest of the tag is displayed using ECOLOR T.
Similarly, numbers within tags are highlighted using ECOLOR C.
Numbers and quoted strings that are not within markup tags are not highlighted.
Character and entity references are highlighted using ECOLOR U.

:COLUMN section

Use the :COLUMN section to specify that the parser should ignore certain columns of your file. For example, in COBOL columns 1 through 6 of a file and all columns beyond column 72 of a file are ignored by the compiler. This would be specified as

:COLUMN
 EXCLUDE  1  6
 EXCLUDE 73  *

Each line of the :COLUMN section has the word EXCLUDE followed by the starting and ending column of a range of columns that the parser is to ignore. The ending column can be given as an asterisk to indicate that all columns through the end of the line are to be ignored.

When the syntax coloring parser processes a line of your file, it will treat the excluded columns as if they were entirely blank. By default, the excluded columns will be displayed with no special highlighting, but you can specify that any of the 9 ALTERNATE colors be used. For example,

:COLUMN
 EXCLUDE 1 10 ALTERNATE 2

would display columns 1 through 10 of your file using ECOLOR 2.

:POSTCOMPARE section

The :POSTCOMPARE section is used to color character sequences that are not handled by any of the other sections of a KLD file. For example, you might want to color operators like ``+'', ``-'', and ``='', or items like ``.T.'' and ``.F.'', which indicate True or False in xBase programs but are not valid identifiers.

The :POSTCOMPARE can contain CLASS lines and TEXT lines.

CLASS lines specify a set of characters that you want to have colored, using the same regular expression character class notation that is used in the :IDENTIFIER section. For example,

CLASS [+-=/]

means that ``+'', ``-'', ``='', and ``/'' characters are to be colored. KEDIT uses ECOLOR I by default, but you can instead specify any of the four alternate keyword colors. For example:

CLASS [+-=/] ALTERNATE 2

TEXT lines specify a string of nonblank characters that is to be colored. For example,

TEXT .T.

would color the character sequence ``.T.''. KEDIT uses ECOLOR D by default, but you can specify an alternate keyword color. For example:

TEXT .T. ALTERNATE 3

You can specify any number of CLASS or TEXT lines in a :POSTCOMPARE section. When applying syntax coloring to your file, the :POSTCOMPARE section is processed last. That is, KEDIT first checks for identifiers, numbers, comments, tags, etc., and checks the items in the :POSTCOMPARE section only if none of these are found.

Note that it is not useful to include valid identifiers in the :POSTCOMPARE section, since the parser checks for identifiers before :POSTCOMPARE is processed, so identifiers, even identifiers that are not listed in the :KEYWORD section, will never be matched by :POSTCOMPARE. For this reason, any identifiers that you want to color should be included in the :KEYWORD section.