Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R: Add regular expression R parser #544

Merged
merged 4 commits into from
Sep 3, 2015
Merged

Conversation

vhda
Copy link
Contributor

@vhda vhda commented Aug 27, 2015

@masatake
Copy link
Member

@vhda, do you have interest to solve this in meta level?
I have a draft version of optlib->c translator(code generator).
I would like to convert all regeex based build-in parsers to optlib.
Instead I will introduce optlib->c translator. So all optlib can be embed into ctags executable.

The biggest challenge is about Makefile; we have to integrate the translator into the ctags build process.
I cannot find the time to work on this topic but if you have interest, I would like you to look the code of translator and inspect the way to integration.

Input is based on your work:

[yamato@x201]~/var/ctags-github/misc% cat r.ctags
cat r.ctags
# See https://github.com/universal-ctags/ctags/pull/544
--langdef=R
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function[ \t]*\(/\1/f,Functions,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^(]+$/\1/g,GlobalVars,Global Variables/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^(]+$/v,FunctionVariables,Function Variables/
--map-R=+.r
--map-R=+.R

Run the translator called ctagsc(this should be ctagst):

[yamato@x201]~/var/ctags-github/misc% bash ctagsc --options=./r.ctags
bash ctagsc --options=./r.ctags
/*
 * Generated by ctagsc
 */
#include "general.h"
#include "parse.h"

static void installRRegex (const langType language)
{
    addTagRegex (language, "^[ \\t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \\t]*<-[ \\t]function[ \\t]*\\(", "\\1", "f,Functions,Functions", NULL);
    addTagRegex (language, "^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \\t]*<-[ \\t][^(]+$", "\\1", "g,GlobalVars,Global Variables", NULL);
    addTagRegex (language, "[ \\t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \\t]*<-[ \\t][^(]+$", "v,FunctionVariables,Function Variables", "", NULL);
}
extern parserDefinition* RParser (void)
{
    static const char *const extensions [] = {
        "r",
        "R",
        NULL
    };
    static const char *const patterns [] = {
        NULL
    };
    static const char *const aliases [] = {
        NULL
    };
    parserDefinition* const def = parserNew ("R");
    def->enabled = TRUE;
    def->extensions = extensions;
    def->patterns = patterns;
    def->aliases = aliases;
    def->initialize = installRRegex;
    def->method     = METHOD_NOT_CRAFTED|METHOD_REGEX;
    return def;
}
/* vi:set tabstop=4 shiftwidth=4: */

ctagsc itself:

[yamato@x201]~/var/ctags-github/misc% cat ctagsc
cat ctagsc
#!/bin/bash
#
# ctagsc - ctags options to C language translator
#
# Copyright (C) 2014 Masatake YAMATO
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
#

declare CTAGS_LANG
declare -a CTAGS_EXTENSIONS
declare -a CTAGS_PATTERNS
declare -a CTAGS_ALIASES

declare -a CTAGS_REGEX_REGEX
declare -a CTAGS_REGEX_NAME
declare -a CTAGS_REGEX_KIND
declare -a CTAGS_REGEX_FLAG

declare CTAGS_DISABLE

error()
{
    echo "[ERROR] " $1 1>&2
    exit 1
}

warn()
{
    echo "[WARN] " $1 1>&2
}

parse_options()
{
    local optlib=$1
    local line
    while read -r line; do
    if [[ -n "$line" ]]; then
        case "$line" in
        (\#*)
            :
            ;;
        (*)
            parse1 "$line"
            ;;
        esac
    fi
    done < $optlib
}

parse_cmdline()
{
    while [[ $# -gt 0 ]]; do
    parse1 $1
    shift
    done
}

print_help()
{
    local status=$1

    # ...

    return $status
}

parse_regex()
{
    local original=$1
    local rest=$1

    local escape

    local regexp
    local name
    local kind
    local flags
    local tmp

    if ! [[ "${rest:0:1}" = "/" ]]; then
    error "Wrong regexp syntax: $rest"
    fi

    rest=${rest:1}
    while [[ -n "$rest" ]]; do
    case "${rest:0:1}" in
        (/)
        if [[ -n "$escape" ]]; then
            regexp="${regexp}${rest:0:1}"
            escape=
        else
            break
        fi
        ;;
        (\\)
        regexp="${regexp}${rest:0:1}${rest:0:1}"
        if [[ -n "$escape" ]]; then
            escape=
        else
            escape=1
        fi
        ;;
        (*)
        escape=
        regexp="${regexp}${rest:0:1}"
        ;;
    esac
    rest=${rest:1}
    done

    if ! [[ "${rest:0:1}" = "/" ]]; then
    error "Wrong name syntax: $rest"
    fi

    rest=${rest:1}
    while [[ -n "$rest" ]]; do
    case "${rest:0:1}" in
        (/)
        if [[ -n "$escape" ]]; then
            name="${name}${rest:0:1}"
            escape=
        else
            break
        fi
        ;;
        (\\)
        name="${name}${rest:0:1}${rest:0:1}"
        if [[ -n "$escape" ]]; then
            escape=
        else
            escape=1
        fi
        ;;
        (*)
        escape=
        name="${name}${rest:0:1}"
        ;;

    esac
    rest=${rest:1}
    done

    if ! [[ "${rest:0:1}" = "/" ]]; then
    error "Wrong kind/flags syntax: $rest"
    fi

    rest=${rest:1}
    while [[ -n "$rest" ]]; do
    case "${rest:0:1}" in
        (/)
        if [[ -n "$escape" ]]; then
            tmp="${tmp}${rest:0:1}"
            escape=
        else
            break
        fi
        ;;
        (\\)
        tmp="${tmp}${rest:0:1}${rest:0:1}"
        if [[ -n "$escape" ]]; then
            escape=
        else
            escape=1
        fi
        ;;
        (*)
        escape=
        tmp="${tmp}${rest:0:1}"
        ;;

    esac
    rest=${rest:1}
    done


    if [[ -z "${rest}" ]]; then
    flags=$tmp
    else
    kind=$tmp
    fi

    if [[ "${rest:0:1}" = "/" ]]; then
    rest=${rest:1}
    while [[ -n "$rest" ]]; do
        case "${rest:0:1}" in
        (/)
            if [[ -n "$escape" ]]; then
            flags="${flags}${rest:0:1}"
            escape=
            else
            break
            fi
            ;;
        (\\)
            flags="${flags}${rest:0:1}${rest:0:1}"
            if [[ -n "$escape" ]]; then
            escape=
            else
            escape=1
            fi
            ;;
        (*)
            escape=
            flags="${flags}${rest:0:1}"
            ;;

        esac
        rest=${rest:1}
    done

    fi

    #    echo regexp: "${regexp}"
    #    echo name:  "${name}"
    #    echo kind:  "${kind}"
    #    echo flags:  "${flags}"

    CTAGS_REGEX_REGEX+=( "${regexp}" )
    CTAGS_REGEX_NAME+=( "${name}" )
    CTAGS_REGEX_KIND+=( "${kind}" )
    CTAGS_REGEX_FLAG+=( "${flags}" )
}

parse1()
{
    local opt
    local pat
    local ext
    local ali
    local rgx
    local tmp

    case $1 in
    (-h|--help)
        print_help 0
        ;;
    (--options=*)
        opt=${1/--options=/}
        parse_options "$opt"
        ;;
    (--langdef=*)
        if [[ -n "$CTAGS_LANG" ]]; then
        error "LANG is already defined as $CTAGS_LANG: $1"
        fi
        CTAGS_LANG=${1/--langdef=/}
        ;;
    (--languages=-${CTAGS_LANG})
        CTAGS_DISABLE=1
        ;;
    (--languages=-*)
        if [[ -z "${CTAGS_LANG}" ]]; then
        error "Don't use --languages=- option before defining a language: $1"
        else
        error "Only language you do --langdef can be disabled: $1"
        fi
        ;;
    (--languages=*)
        error "--languages can be used only for disable a language you defined with --langdef: $1"
        ;;
    #
    # --map-<LANG>
    #
    (--map-${CTAGS_LANG}=+\(*\))
        pat=${1/--map-${CTAGS_LANG}=+\(/}; pat=${pat%%\)}
        CTAGS_PATTERNS+=( $pat )
        ;;
    (--map-${CTAGS_LANG}=+.*)
        ext=${1/--map-${CTAGS_LANG}=+./}
        CTAGS_EXTENSIONS+=( $ext )
        ;;
    (--map-${CTAGS_LANG}=*)
        error "Other than the notation for appending(--map-<LANG>=+...) is not supported"
        ;;
    (--map-*=*)
        if [[ -z "${CTAGS_LANG}" ]]; then
        error "Don't use --map-<LANG>= option before defining a language: $1"
        else
        tmp=${1/--map-/}
        tmp=${tmp/=*/}
        error "The language is not defined: ${tmp}"
        fi
        ;;
    #
    # --alias-<LANG>=+*
    #
    (--alias-${CTAGS_LANG}=+*)
        ali=${1/--alias-${CTAGS_LANG}=+/}
        CTAGS_ALIASES+=( $ali )
        ;;
    (--alias-${CTAGS_LANG}=*)
        error "Other than the notation for appending(--alias-<LANG>=+...) is not supported"
        ;;
    (--alias-*=*)
        if [[ -z "${CTAGS_LANG}" ]]; then
        error "Don't use --alias-<LANG>= option before defining a language: $1"
        else
        tmp=${1/--alias-/}
        tmp=${tmp/=*/}
        error "The language is not defined: ${tmp}"
        fi
        ;;
    #
    # --regex=
    #
    (--regex-${CTAGS_LANG}=*)
        if [[ -z "${CTAGS_LANG}" ]]; then
        error "Don't use --regex-<LANG>= option before defining a language: $1"
        fi
        rgx=${1/--regex-${CTAGS_LANG}=}
        parse_regex "$rgx"
        ;;
    (--regex-*=*)
        if [[ -z "${CTAGS_LANG}" ]]; then
        error "Don't use --regex-<LANG>= option before defining a language: $1"
        else
        tmp=${1/--regex-/}
        tmp=${tmp/=*/}
        error "The language is not defined: ${tmp}"
        fi
        ;;
    #
    # --corups
    #
    (--corpus-${CTAGS_LANG}=*)
        if [[ -z "${CTAGS_LANG}" ]]; then
        error "Don't use --corpus-<LANG>= option before defining a language: $1"
        else
        echo "/* [WARN] $1: --corpus-<LANG> is not implemented yet. */"
        fi
        ;;
    #
    # TODO, --xcmd-<LANG>,   --input-encoding-<LANG>=encoding, --kinds-<LANG>=[+|-]kinds, 
    #

    #
    # rest
    #
    (--langmap=*)
        error "Use --map-<LANG> option instead of --langmap=: $1"
        ;;
    (*)
        warn "Unhandled argument: $1"
        ;;
    esac
}

emit()
{
    local count
    local i
    local elt

    local enabled

    local regex
    local name
    local kind
    local flags

    cat <<EOF
/*
 * Generated by ctagsc
 */
#include "general.h"
#include "parse.h"

static void install${CTAGS_LANG}Regex (const langType language)
{
EOF

    count=${#CTAGS_REGEX_REGEX[@]}
    for i in $(seq 1 $count); do
    i=$(( i - 1 ))

    regex=\""${CTAGS_REGEX_REGEX[$i]}"\"
    name=\""${CTAGS_REGEX_NAME[$i]}"\"
    kind=\""${CTAGS_REGEX_KIND[$i]}"\"
    if [[ -z "${CTAGS_REGEX_FLAG[$i]}" ]]; then
        flags=NULL
    else
        flags=\""${CTAGS_REGEX_FLAG[$i]}"\"
    fi

    cat <<EOF
    addTagRegex (language, $regex, $name, $kind, $flags);
EOF

    done
    cat <<EOF
}
EOF

    cat <<EOF
extern parserDefinition* ${CTAGS_LANG}Parser (void)
{
    static const char *const extensions [] = {
EOF
    for elt in ${CTAGS_EXTENSIONS[@]}; do
    echo "      \"$elt\","
    done
    cat <<EOF
        NULL
    };
    static const char *const patterns [] = {
EOF
    for elt in ${CTAGS_PATTERNS[@]}; do
    echo "      \"$elt\","
    done
    cat <<EOF
        NULL
    };
    static const char *const aliases [] = {
EOF
    for elt in ${CTAGS_ALIASES[@]}; do
    echo "      \"$elt\","
    done
    cat <<EOF
        NULL
    };
    parserDefinition* const def = parserNew ("${CTAGS_LANG}");
EOF
    if [[ -n "${CTAGS_DISABLE}" ]]; then
    enabled=FALSE
    else
    enabled=TRUE
    fi
    cat<<EOF
    def->enabled = ${enabled};
EOF
    cat<<EOF
    def->extensions = extensions;
    def->patterns = patterns;
    def->aliases = aliases;
    def->initialize = install${CTAGS_LANG}Regex;
    def->method     = METHOD_NOT_CRAFTED|METHOD_REGEX;
    return def;
}

/* vi:set tabstop=4 shiftwidth=4: */
EOF
}

parse_cmdline "$@"
emit
[yamato@x201]~/var/ctags-github/misc% 

@masatake
Copy link
Member

The weakness of this code is that bash is needed to compile.
But I think this is acceptable.

@vhda
Copy link
Contributor Author

vhda commented Aug 28, 2015

As you may have noticed, I've barely done much more than commenting on some PRs and/or issues.
I still didn't even complete the Makefile help target that you asked for.
For this reason I don't want to accept new responsibilities until I complete everything else.

@masatake
Copy link
Member

For this reason I don't want to accept new responsibilities until I complete everything else.

I see. I will add one more comment. Please, look at it.

@masatake
Copy link
Member

... I found a R parser implementation is in geany.
We should merge the two parser implementations for R.

@vhda
Copy link
Contributor Author

vhda commented Aug 28, 2015

Makes sense. I'll try to find some time along the weekend.

@masatake
Copy link
Member

Thank you. We don't have to be hurry.

vhda added 2 commits September 2, 2015 21:50
geany/geany@c67caf0
Originally authored by Ascher Stefan <stievie@utanet.at>
@vhda
Copy link
Contributor Author

vhda commented Sep 2, 2015

Basic R support should be working now.
I decided to re-indented Geany's code to match our rules, even though this will make it harder to share updates between the two repositories.

@masatake
Copy link
Member

masatake commented Sep 3, 2015

@vhda, thank you for taking time.
I guess you know R language.(I don't know it.)

How do you think about R_REGEX ifdef condition?
See #308. We should expect regex library is always available.
So the fallback-code for environment where regex lib is not available is not needed.
The quality of generated tags is only the concern of us.

So the question is which is better, crafted parser or regex parser?
I would like to maintain only better one.

If these two implementations are complementary each other, let's use both but #else and METHOD_NOT_CRAFTED are not needed.

(One of the biggest surprising, when I started working on ctags, was that multiple implementations
for the same language can live together. This is applicable to xcmd. )

@vhda
Copy link
Contributor Author

vhda commented Sep 3, 2015

I know nothing about R, other than some of the examples I searched for while updating this parser.

At this point there aren't many differences between the regex and line parsers, except that the line parser is able to find function definitions that cover more than one line. Nevertheless, I think that a line parser is probably faster and more extensible than a regex parser, so I've kept the first.

@masatake
Copy link
Member

masatake commented Sep 3, 2015

@vhda, I see. Thank you. LGTM. I have one more request. Could you add a test case for `l' kind?
Then, please, merge the commits.

@vhda
Copy link
Contributor Author

vhda commented Sep 3, 2015

Hmm... something is fishy here.
The "extended" test case includes both a library and a source call that properly appear in the output of ctags:

> ./ctags -o - Units/parser-r.r/r-extended.d/input.r
.First  Units/parser-r.r/r-extended.d/input.r   /^.First <- function() {$/;"    f
MASS    Units/parser-r.r/r-extended.d/input.r   /^  library(MASS)                         # attach a package$/;"    l
file.path(Sys.getenv("HOME" Units/parser-r.r/r-extended.d/input.r   /^  source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))$/;" s

But expected.ctags is incorrect and make units is still passing.

@masatake
Copy link
Member

masatake commented Sep 3, 2015

I tried. Something wrong. Maybe a bug of units. I will fix it.

@masatake
Copy link
Member

masatake commented Sep 3, 2015

expected.ctags should be expected.tags.

@masatake
Copy link
Member

masatake commented Sep 3, 2015

If units cannot find expected.tags file, it reports "passed".

@masatake
Copy link
Member

masatake commented Sep 3, 2015

file.path(Sys.getenv("HOME" Units/parser-r.r/r-extended.d/input.r   /^  source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))$/;" s

Interesting.

The output is helpful for a user who wants to understand the code.
However, I wonder how the other users think about this tag.

I will user ctags for understanding the target code, so I myself loves this tag.

@vhda
Copy link
Contributor Author

vhda commented Sep 3, 2015

My bad... but could I ask you to modify units output to include something like:

Testing r-extended as R                                     passed ("expected.tags" not found)

vhda added 2 commits September 3, 2015 11:22
Changes are based on this StackOverflow question:
http://stackoverflow.com/questions/32206608/ctags-and-r-regex

* Add two new types: global variables and function variables.
* Require a '(' after the function in order to guarantee distinction
  between variable and function declarations.
* Add description to all tag types.
* Add test cases.
Add support for global and function variables to match functionality of
the regex parser.
@masatake
Copy link
Member

masatake commented Sep 3, 2015

My bad... but could I ask you to modify units output to include something like:

I see. Good idea. I should do so.

@masatake
Copy link
Member

masatake commented Sep 3, 2015

LGTM.

vhda added a commit that referenced this pull request Sep 3, 2015
R: Add regular expression R parser
@vhda vhda merged commit abf7d43 into universal-ctags:master Sep 3, 2015
@vhda vhda deleted the r/create branch September 3, 2015 10:41
@petobens
Copy link

petobens commented Sep 9, 2015

Hi @vhda and @masatake , thanks for adding R support. I have however one problem. If I create a foo.R file containing:

x <- 2

Then add to my ctags file:

--langdef=R
--langmap=R:.r.R
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function[ \t]*\(/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^\(]+$/\1/g,Global_Variables/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^\(]+$/\1/v,Function_Variables/

and finally run ctags -R foo.R, I get a tags file containing:

!_TAG_FILE_FORMAT   2   /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED   1   /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_PROGRAM_AUTHOR    Universal Ctags Team    //
!_TAG_PROGRAM_NAME  Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL   https://ctags.io/   /official site/
!_TAG_PROGRAM_VERSION   Development //
x   foo.R   /^x <- 2$/;"    g

On the the other hand if I remove the regex from my ctags file (in order to use the new R built in support) and then again run ctags -R foo.R I now get:

!_TAG_FILE_FORMAT   2   /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED   1   /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_PROGRAM_AUTHOR    Universal Ctags Team    //
!_TAG_PROGRAM_NAME  Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL   https://ctags.io/   /official site/
!_TAG_PROGRAM_VERSION   Development //

Is that the expected behavior?

If I now rename my foo.R file to foo.r and run ctags -R foo.r the tags are correctly generated and the behavior seems fixed.

@vhda
Copy link
Contributor Author

vhda commented Sep 9, 2015

@petobens apparently I've forgot to also include .R as a valid R file extension.
I'll quickly fix that for you.

@vhda
Copy link
Contributor Author

vhda commented Sep 9, 2015

Fixed in #553

@petobens
Copy link

petobens commented Sep 9, 2015

@vhda awesome thank you very much!

I have one last question. Suppose I have two files:
i) bar.R:

bar <- function (x) {
    return(x)
}

and ii) foo.R

source("bar.R")
print(bar(2))

Ideally I would like to jump from bar(2) in foo.R to the bar function definition in bar.R. Can I somehow achieve this with ctags? I believe that what I'm looking is to generate tags for included files (i.e once tags are generated for foo.R also generate tags for bar.R since bar.R is included with source.)
Once again thanks in advance for the help and for the previous quick fix.

@vhda
Copy link
Contributor Author

vhda commented Sep 9, 2015

That really depends on the editor you are using, but in general you don't need to follow the "include" statement, you simply search for the tag "bar".
Vim, for example, supports the two methods. By pression ctrl-] you jump to the tag declaration and if [ ctrl-i is used then it will follow the includes (if properly configured). Other editors might have similar or completely different behaviours.

@petobens
Copy link

petobens commented Sep 9, 2015

@vhda I'm using VIm. I will try what you suggested. Thanks.

masatake referenced this pull request in vhda/ctags Oct 6, 2015
* Use $BASH_VERSION to check if current shell is bash.
* Default shell is now /bin/sh (it would not work if it was non SH POSIX
  shell, like /bin/tcsh).
* Typo fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants