DWDS/Dialing Concordance Code

a collection of indexing and search tools for corpus linguists

Brought to you by: garabik, kzimmer, mukau, sokirko
[r51]: / ddc / trunk / Changes Maximize Restore History
198 lines (176 with data), 10.5 kB

##-*- Mode: Change-Log; coding: utf-8; -*-

v2.0.7 Wed, 27 Mar 2013 16:52:54 +0100 moocow
	+ fixed segfault bug in CHitBorders::GetPageNumber() when requesting page-break number 4294967295
	  i.e. 0xffffffff, i.e. ddc constant UnknownPageNumber
	+ problem ocurred in a dwds kerncorpus test set; symptoms:
	  - initial page as declared by ddc opt-file 'page' bibl field wasn't getting read properly
	  - page counter was getting 'inherited' across various files
	+ ddc_simple comfort changes
	  - options are now case-insensitive
	  - added handly aliases -h, --help, -json, -table, -text

v2.0.6 2013-03-21 14:05:24 +0100 moocow
	+ added generic wildcard operator '*'
	  - uses stupid all-values expansion (like /.?/), but no need for regex-based evaluation, so slightly faster
	+ added #=, #<, and #= phrase-query distance operators

v2.0.5 Fri, 04 Jan 2013 15:58:35 +0100 moocow
	+ re-factored bibliographic filtering in Bibliography.(h|cpp), QueryFilter.(h|cpp)
	+ added new CConcXml member slot m_RegexOpts : initialized from CConcIndexator at option-load time (ConcordOptions.cpp)
	+ added new FreeBiblIndex member m_pRegexOpts : pointer to m_RegexOpts for parent CConcXml object (utf8 by default)
	+ class TxCab: added "m_MapMode" argument
	+ added class TxCabMap (default m_MapMode=1)
	+ removed implicit append of "&qd=" for TxCab and descendants: now only included if argument URL doesn't end in '=',
	  to allow more flexible URL specifications
	+ added CBiblExpander.(h|cpp): external bibliographic pseudo-fields
	  - implementation currently just wraps CTermExpander classes (except for Chain)

v2.0.4 Mon, 10 Dec 2012 15:32:28 +0100 moocow
	+ fixed annoying queries-must-end-with-whitespace bug in ddc_simple.cpp
	+ re-worked "#HAS_FIELD" parsing, compilation, and evaluation routines
	+ added support for negated #has filters "!#has[...]"
	+ added support for negated regexes in #has expressions ("#has[x,!/r/]" acts like "!#has[x,/r/]")
	+ added safe escapes for "*" wildcards in #has expressions
	+ added explicit set-wise disjunction for #has expressions: #has[x,{a,b,c,...}]
	+ TODO: external expansion API a la "|"-notation for #has filters

v2.0.3 Fri, 28 Sep 2012 09:09:41 +0200 moocow
	+ backwards-compatibility fix: remove index-list from text and html bibl output

v2.0.2 Mon, 16 Jul 2012 13:43:13 +0200 moocow
	+ better handling of startup errors
	+ ddc_daemon now uses an additional sentinel file (aka 'wait-file') to determine when and if the (forked) server process has started up
	+ ugly hack with hard-coded time limit; alternative would be socketpair() or the like

v2.0.1 Fri, 02 Dec 2011 14:56:18 +0100
	+ fixed various static buffer overflows
	+ "real" static buffers can now use global define DDC_STATIC_BUFLEN from ddcConfig.h
	+ added configure argument --with-static-buflen=NBYTES (default=16384)

v2.0.0 Mon, 14 Nov 2011 15:10:07 +0100
	+ message length arguments over sockets now always passed in lsb order (mostly compatible)
	+ better version compatibility checking for stored indices
	+ updated license files COPYING, COPYING.LESSER to LGPL-3.0

v1.80.dx-1
	2011-10-11 15:13  moocow
	+ added new #random, #random[SEED] query-sort operators
	  - basically works; cache gets in the way of "true" randomization though (workaround: regex hack in wrapper cgi)
	+ added AllowUnsafeQueries option to CConcIndexator: if false (default), file-list queries are disabled

	2011-07-08 21:03  moocow
	+ added ddc_opt.pod : opt-file documentation (largely ganked from old README)
	+ added ddc_query.pod: query syntax

	2011-07-06 13:45  moocow
	+ term expansion chains working with lexer revision (pipeline suffix notation)
	+ added expand_terms request to ddc daemon
	+ moved command-line utilities from camel-case to underscore-separated, e.g. ddcIndex -> ddc_index
	+ got prefix, suffix, infix query types working right for arbitrary indices

	2011-06-30 14:37  moocow
	+ added abstract term expander API in ConcordLib/TermExpander.*
	+ ported old built-in expanders to new API
	+ added external expander protocol CAB (tt/http)
	+ improved server->client error reporting

	Wed, 08 Jun 2011 15:30:55 +0200 moocow
	+ fixed segfaults in rank- and bigram-sort operators
	+ re-defined DWORD,WORD to uint32_t rsp uint16_t for better 32/64-bit compatibility
	+ lexer+parser pair fully re-written

	Thu, 19 May 2011 15:30:09 +0200 moocow
	+ yet another lexer reset fix in ConcordLib/QueryParser.cpp
	+ added single-quoted symbols to query lexer (e.g. 'sapere' @'aude')

	Wed, 18 May 2011 15:46:00 +0200 moocow
	* more lexer+parser hacks
	  + only escape \uXXXX sequences in regexes, since other backslash escapes
	    are probably needed by the regex engine
	  + removed common/json.(h|cpp); moved functions to common/ddcString.(h|cpp)
	* c++-ified utf8 code to common/utf8xx.(h|cpp)
	  + haven't adapted the whole api yet
	* major futzing about in PCRE (regex library) interface in PCRE/pcre_rml.(h|cpp)
	  + index-based regex queries should now respect the CConcIndexator::m_Utf8 flag,
	    if the regexes are passed the struct returned by CConcIndexator::GetRegexOptions()
	    - basically a generalization of the old CConcIndexator::GetRegExpTables() strategy
	  + still some goofs with POSIX character classes (e.g. [:alpha:]) and non-ASCII
	    characters (e.g. 'ä' matches [^[:alpha:]])
	    - this might have to do with bad passing of UTF8 option from the pcrecpp RE_Options
	      struct to the bitmask used by the underlying C pcre code called e.g. from
	      RML_RE::Compile()
	  + legacy table-based (bytewise) RML_RE constructor
	      RML_RE(const string& pat, const vector<BYTE>& RegExpTables)
	    re-implemented, since it's called elsewhere, e.g. by MorphWizardLib/wizard.cpp
	  + still more regex-related sanitizing todo, e.g. for #has_field[] queries

	Tue, 17 May 2011 17:02:32 +0200 moocow
	* re-worked query lexer+parser pair used by ConcordLib/QueryParser.h
	    simple_query.[ly] --> yyQLexer.l, yyQParser.y
	* eliminated useless and confusing sed calls in scanner+parser generation
	* removed stale MyFlexLexer.h from distribution
	* bug fixes in C-style escapes
	  + added json-style "\uXXXX" utf-8 escapes
	  + added common/utf8.[ch]
	  + moved generic string-handling (currently only C escape|unescape) to
	    common/ddcString.(cpp|h)
	* started re-working query lexer+parser pair src/ConcordLib/yyQ(Lexer|Parser).[ly]
	  + moved ugly multi-rule symbol detection to a single pattern for {symbol_text}
	  + allow backslash, C, and json-style escapes with CDecodeString()
	    - requires more cleanup in QueryNode.cpp since some of the operator syntax
	      was checked and removed there (ugly and inflexible)
	* occasional cleanup required in QueryParser.cpp; in particular in yyqlex() method
	  + mostly checks for lexer return value to set YYSTYPE appropriately
	  + this is probably pointless; we should either set YYSTYPE in the lexer
	    or just use _prs->yytext() etc from the parser
	* still need to check various query types:
	  (multi-word strings, &&, ||, near, with, (), #has_whatever, thesaurus, chunk, ...)

	Mon, 16 May 2011 16:07:30 +0200 moocow
	* added iconv wrapper class common/ddcIconv.h
	* added character set converion hack for German in LemmatizerLib/Lemmatizers.h
	  + does semi-transparent recoding from user queries in UTF-8 to underlying latin-1
	    morphology data
	  + tried recoding morphology to UTF-8, but this breaks alphabet size (hacked) as
	    well as various ugly hard-coded character set hacks in common/utilit.(h|cpp),
	    in particular the byte-wise property bitmasks in the table ASCII[256] from
	    utilit.cpp ... morphology recoding stuff lives in the ddc-morph 'utf8' branch
	    anyways, but probably should not be used
	* TODO: remove __ALL__ language-dependent code from the DDC core
	  + if really necessary move it to dlopen()able module(s) for better language modularity
	    and potential replacement of the actual morphology used.


	Fri, 13 May 2011 12:22:43 +0200 moocow
	* added corpus filename field ('file') to table output
	* added '-' as alias for stdout for ddcSimple, ddcConsole
	* re-worked filename auto-detection code in utilit.cpp
	* added comments with '#' to .opt file parsing in ConcordOptions.cpp
	* .opt file parsing now accepts C-style escapes \x09, \t, ... for delimiters
	* added ConcIndexator field m_TokenDelimiter : token-initial delimiter
	  + fixes broken token boundary parsing for table, text formats
	  + parsed from .opt file as 'TokenDelimiter' (default=empty: none)
	* added Utf8 (m_Utf8) flag to .opt file, (class ConcIndexator)
	  + boolean: whether to assume corpus data is utf8-encoded
	  + currently only effects json output mode

	Wed, 11 May 2011 21:18:36 +0200 moocow
	* forked sources from sourceforge CVS
	  + moved CVS sub-project directories to src/ (formerly Source/):
	    CVSROOT=ddc-concordance.cvs.sourceforge.net:/cvsroot/ddc-concordance/SUBDIR -> Source/SUBDIR
	* converted build system from legacy Makefiles to autoconf+automake
	* factored out everything under old Dicts/ directory into package ddc-morph
	* install all built libraries to PREFIX/lib
	* install all headers from src/ to $RML/include/
	  + $RML/include mirrors src/ substructure so as not to break internal #includes
	* renamed runtime directories to lower-case following usual UNIX conventions:
	  $RML/Bin -> $RML/bin
	  $RML/Docs -> $RML/doc
	  $RML/Logs -> $RML/log
	  $RML/Dicts -> $RML/dict
	  Source/ -> src
	* moved runtime configuration files from $RML/bin/ to $RML/etc/
	  $RML/etc/rml.ini
	  $RML/etc/ddc_local_corpora.cfg
	  $RML/etc/ddc_server.cfg
	  $RML/etc/ddc_xml_server.cfg
	* added shared prefix 'ddc' to all runtime executables in $RML/bin:
	  ConcordIndex -> ddcIndex
	  ConcordConsole -> ddcConsole
	  ConcordDaemon -> ddcDaemon
	  ConcordSimple -> ddcSimple
	  Search -> ddcSearch
	  FileLem -> ddcFileLem
	  MorphGen -> ddcMorphGen
	  StructDictLoader -> ddcStructDictLoader
	* changed default daily log-file name for ddcDaemon to use strftime "%F"
	  format (ISO 8601):
	  $RML/log/concord/YYYY-MM-DD.log
	* changed integer type sent over network sockets for message lengths from size_t
	  to uint32_t (a la C99 stdint.h) in src/common/string_socket.*
	  + this should fix protocol ambiguities between 32- and 64-bit systems
	  + still doesn't solve big-/little-endian ambiguity
	    - TODO: add byte-order detection & twiddling code to handle this
	* added README.pod (-> README.txt, README.html)

##-- for older changes, see doc/DDC_ChangeLog.txt