Okay so I'm working on a language interpreter to settle a bet with a friend. Unfortunately, right now the tokenizer I've writen is taxing the processor. Originally I had it set up to all be handled by the regex engine. It used almost no processing time, but it was dropping important characters, and seemed somewhat unpredicable (in the sense that it wouldn't always return the longest match). Any ideas (short of rewrite it in C with lex and yacc) would be appreciated.
Code:
;original
(defun tokenizer (str re-list &optional (output nil))
(let ((target (string-trim '(#\Newline #\Tab #\Space) str)))
(if (string= "" target)
output
(let
((pos
(car (sort (delete-if-not
(lambda (x) (and (numberp (car x)) (numberp (cadr x))))
(mapcar
(lambda (re)
(multiple-value-bind (start end)
(cl-ppcre:scan re target)
(list start end))) re-list))
(lambda (x y)
(cond ((< (car x) (car y)) t)
((and (= (car x) (car y)) (> (cadr x) (cadr y))) t)
(t nil)))))))
(if (null pos)
(append1 output target)
(tokenizer
(subseq target (cadr pos)) re-list
(if (> (car pos) 0)
(nconc output (list (subseq target 0 (car pos))
(subseq target (car pos) (cadr pos))))
(append1 output (subseq target (car pos) (cadr pos))))))))))
EDIT: Got some help from the kind folks over at irc.freenode.net#lisp (second code snippet reflects some of their suggestions), and it now uses an acceptable amount of processing power (memory usage might need som work) but any other ideas would be appreciated.
Code:
;post suggestions from #lisp
(defun tokenizer (str re-list &optional (output nil))
;re-list is now a precompiled list of re's instead of text strings
;that would be compiled by cl-ppcre:scan
(let ((target (string-trim '(#\Newline #\Tab #\Space) str)))
(if (string= "" target)
output
(let
((pos
;reduce replaces the expensive calls to sort
(reduce (lambda (x y)
(cond ((< (car x) (car y)) x)
((and (= (car x) (car y)) (> (cadr x) (cadr y))) x)
(t y)))
(delete-if-not
(lambda (x) (and (numberp (car x)) (numberp (cadr x))))
(mapcar
(lambda (re)
(multiple-value-bind (start end)
(cl-ppcre:scan re target)
(list start end))) re-list)))))
(if (null pos)
(append1 output target)
(tokenizer
(subseq target (cadr pos)) re-list
(if (> (car pos) 0)
(nconc output (list (subseq target 0 (car pos))
(subseq target (car pos) (cadr pos))))
(append1 output (subseq target (car pos) (cadr pos))))))))))[