Some thoughts about a test framework for PIPS

JanWielemaker · November 23, 2024, 5:57pm

I had a little look into testing. I think there are two basic approaches

XSB
A test is a Prolog file with some main goal. The test driver (a shell script) runs Prolog, loads the file and runs the main goal. That prints output that is compared with the correct output.
The others have some Prolog way to specify the code to run and describe how it should perform. SICStus and SWI-Prolog use SWI-Prolog’s PlUnit (although the driver in SWI-Prolog has been mostly rewritten recently). Logtalk’s lgtunit is inspired by this as well. Ciao fits tests in its assertion language. That looks very different, but the functionality seems comparable. ECLiPSe has .tst files that contain goals and their expected output.

I dislike (sorry) XSB’s solution for several reasons. Two stand out: textual equality testing of Prolog terms is flaky and the framework depends on Unix tools.

I think we roughly have three options

Adopt one of the existing frameworks and let each Prolog implement their own test driver.
Define a new one and again, let each Prolog implement their own test driver.
Adopt Logtalk for running PIP tests.

The main advantage of the first two approaches is that the PIP tests can run natively in the test infrastructure of each system. The main advantage of using Logtalk is that it provides a portable solution to both the tests and support predicates you may need to run the tests.

FYI. SWI-Prolog has tests from ECLiPSe (strings) and XSB (tabling). The ECLiPSe tests use a modified version of @jschimpf’s test_util_iso.pl. The XSB tests basically reimplement the XSB test driver logic in Prolog and uses rewrite tricks to turn each .P test file into a SWI-Prolog unit test. That was a lot of work, but then there were a lot of tests to be reused

pmoura · November 23, 2024, 7:01pm

For reference on lgtunit:

Summary: Tools - Logtalk
Documentation: lgtunit — The Logtalk Handbook v3.85.0 documentation
Testing automation script: logtalk_tester man page
Test reports automation script: logtalk_allure_report man page

As lgtunit tool supports multiple test dialects, running existing tests without modifying them should be possible. See a simple example at:

There’s also mature support for property-based testing. See the blog post in the “testing” category.

pmoura · November 23, 2024, 7:06pm

Trying to reply and the software forum prevents me of writing posts with more than two links, tells me that I cannot post links to my website, and that my posts are classified as spam. Can someone fix the forum settings? Thanks.

jfmc · November 23, 2024, 8:02pm

Hi @JanWielemaker, I think that to avoid a chicken-and-egg situation PIPs should not impose any testing framework. My vote goes for:

Let the PIP authors decide what is the most effective way to specify tests and specifications (natural language, pseudo-code, reference implementations, etc.)
It could be recommended (but not mandatory) to write a reference (executable) test suite, that will help implementation and checking for conformance.
Allow any existing test framework (as soon as the semantics are clear)

A unified test framework or test framework compatibility would be another PIP itself.
Maybe supplementary PIP material like reference implementations or can be added as PIP revisions.

Theresa_Swift · November 23, 2024, 9:20pm

First of all, kudos to Jan – I know he did some hard work to translate a large portion of the XSB tests into his own form. I don’t fully agree with his opinon, but his opinion not based on ignorance.

I do agree with Jan about the difficulty of running the XSB test suite on non-Unix systems (Windows). David, who uses Windows, always has Cygwin installed so that he can run the XSB test suite.

That said, I think the XSB test framework has some nice points.

First, it is lightweight in terms of system state. Many tests change the system state by modifying dynamic code, creating or abolishing tables and so on. The XSB framework allows one to run as many tests in a test file as needed, but to obtain a fresh state by making a new test file.

This often makes it easy to pinpoint system issues. If we implement some shaky new feature that causes unexpected problems like a core dump, only a few tests are affected.

As a result the framework has proved useful to us. There are about 900 test files, and most of the test files perform multiple tests – so likely several thousand dests. As an aside the Ergo test suite has about 600 test files, again usually with several tests per file.

Also the XSB test framework should be simple for other Prologs to run: only a couple of environment variables need to be changed. bling

But maybe I’m not understanding some of the difficulties that Jan faced when he ported XSB’s tabling tests. I think we should discuss testing in one of our Thursday meetings (after we’ve reviewed Joachim’s PIP proposal). My current opinion is that if we have multiple test frameworks that are easily portable to multiple systems, we should allow their use.

jschimpf · November 23, 2024, 10:08pm

As mentioned by @JanWielemaker, there is a simple pure ISO test suite (harness and 950 tests) I did in 2013.

The test harness is just 200 lines of plain ISO Prolog. Test patterns are quick to write and look like

functor(foo(a), fo, 1)	    should_fail.
functor(foo(a, b, c), X, Y)	should_give X==foo, Y==3.
functor(1, X, Y)		    should_give X==1, Y==0.
functor([_|_], '.', 2)	    should_give true.
functor(X, Y, 3)		    should_throw error(instantiation_error, _).
functor(X, foo, a)		    should_throw error(type_error(integer, a), _).

Output is simply

...
----- Finished tests from file iso.tst
953 tests found.
895 tests succeeded.
51 tests failed.
7 tests skipped.

pmoura · November 23, 2024, 10:12pm

Needless to say, lgtunit doesn’t require any porting and work as-is with all systems participating here ( and several more). The Prolog standards compliance suite flags multiple bugs in most of these systems. It also includes test sets for Prolog features such as Unicode (wip) and unbound integer arithmetic. Happy to add tests for other features such as tabling. All together, the current Logtalk distribution includes more than 10k tests. Automation support is provided for both POSIX and Windows systems. Report generation is included.

I don’t see a strong requirement to select a single testing framework. I will be however happy if people here actually run those tests and start fixing the exposed bugs, notably those in core features. That would be meaningful progress. Is really simple. On POSIX:

$ cd logtalk
$ logtalk_tester -p eclipse (or ciao, or xsb, or swi, or...)

On Windows:

PS > cd Logtalk
PS > logtalk_tester.ps1 -p eclipse (or ...)

Do you want a nice report?

$ logtalk_tester -p eclipse -f xunit
$ logtalk_allure_report
$ allure open

Similar on Windows. Just add the .ps1 extension.

jfmc · November 24, 2024, 8:14am

Just to contribute to this discussion, the Ciao test suite framework based on assertion was used also to encode ISO Prolog conformance test: GitHub - ciao-lang/iso_tests: Tests for ISO Prolog conformance (1047 test cases).

We’d also like to run logtalk_tester regularly (or other test suites easy to port) for checking compliance too. This will surely catch more errors, wrongly encoded test cases, etc.

pmoura · November 24, 2024, 10:07am

Example with ECLiPSe by running just the Prolog standards compliance suite (note the number of tests):

$ cd ~/logtalk/tests/prolog 
$ logtalk_tester -p eclipse -w
% Batch testing started @ 2024-11-23 22:17:59
%         Logtalk version: 3.86.0-b01
%         ECLiPSe version: 7.0.57
% ...
% 192 test sets: 179 completed, 13 skipped, 0 broken, 0 timedout, 0 crashed
% 3509 tests: 156 skipped, 2868 passed, 485 failed (0 flaky)
%
% Batch testing ended @ 2024-11-23 22:29:42

Incomplete implementation of the format/2-3 predicates account for 282 test failures (tests are repeated with chars, codes, and atom). There are also some tests that are not consensual (with different Prolog implementers complaining of different ones…). But there also exposed bugs in core built-in predicates such as unify_with_occurs_check/2 and missing de facto standard arithmetic functions (e.g. hyperbolic functions).

My general advice is to fix first the test failures that you recognize as bugs and leave discussions on tests you disagree on to later.