1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
|
# PLAN
This branch changes the pipeline to insert the syntax CG right after
disam and capstag (before tagger), which gives us syntax @-tags on all
lexical units. Then, after bidix and lexical selection, the new step
refsyn.t1x removes the @-tags, but also attends to the last seen @subj
and puts it in the ref-field (of the following words). Then t1x reads
the ref-field and uses that to compute the gender/number of pp's
created from passives.
- See `refsyn.t1x` which stores `cur_subj` and places it in the ref
field (as well as remove syntactic function tags so tNx doesn't have
to deal with them – using syntax tags in transfer will have to be a
later extension)
We wait with apertium-anaphora for now. What we want is to put the
governed @subj – which is typically the *nearest* – into <clip
side="ref"> for transfer to use. It's syntactic, not anaphoric.
## Pretty much done:
- lrx needs to deal with @-tags (typically does, but some rules might
end in <aa>)
- lsx needs to deal with @-tags
- Explicit <s n="aa"/><d/> in lsx needs to be turned into <par n="d"/>
(or <par n="d:"/>) which can skip the function tags; try
`xmllint --xpath '//l/s[@n="aa"]/../../l/d/../..' apertium-nno-nob.nob-nno.lsx`
- passive gender/number now uses the `ref` field, via refsyn.t1x
- In some cases, we need to use a subject that's to the right,
refsyn.t1x needs some rules for that
- Syntax CG now before tagger, using syntax for disambiguation.
- Most lrx/lsx regressions fixed.
- Use the subj→ref method for participles as well as passives. The old
method was to "disambiguate" participles based on preceding subject,
but that fails when subject changes gender in bidix, since we
disambiguate based on nob gender. OTOH it's the target language
subject that gets stored in the `ref` field, which *will* have the
right gender. T1X now uses the `ref` field if it's set, falling back
to the input gender/number (given by the old method) if unset.
- Subjects of relative clauses and subclauses tagged as @xubj
- de<pl><@subj> nevner mannen<mf><@xubj> i 50-åra som …
- et antall<sg><@subj> av deres krigere<pl><@xubj> …
- Regular (non-pp) adjectives
- restricted to the cases where the subject is plural, or it changes
in translation between nt and non-nt genders
## Main work remaining:
- Remaining regressions in passive genders (missing refsyn.t1x
patterns, bad syntax disambiguation?)
- We should not remove marked plurals! Our tagging doesn't show if
e.g. utelatt<adj><pp><pl> was the ambiguous form "utelatt" or the
plural-only "utelatte", so T1X might turn that into mf.sg.ind to
match the @subj, but if it actually was unambiguously plural in nob
we should keep it that way (e.g. when adj is used as noun in
"rapportering av utelatte").
- Sometimes the subject is a whole clause – should give nt:
- [At disse<pl> ble solgt] er fint<nt!>
- [Hva de mistenkte<pl> skal ha gjort], er ikke kjent<nt!>.
May require "outer" ref, flip on clause boundary.
But we can't always trust that som+comma ends the clause:
- Han stadfestar at<nt> alle tenestemennene<pl> som avfyrte skot, no
er avhøyrde<pl>
- Coordination should give plural
- To politivakter og en vakt fra et privat vaktselskap ble drept<pl!>
- Et fransk kjærestepar sier til VG at det var de som fant norske Maren Ueland og danske Louisa Vesterager Jespersen drept i Marokko mandag morgen.
But sometimes hard to know if it's the same person (sg) or different (pl):
- Programleder og tidligere gjengleder skutt og drept i København.
- Missing relative pronouns are difficult:
- Jeg er her på grunn av beslutninger<pl> tatt<pl> på europeiske møter
- Brasil<nt> er hardt truffet med vel 55.000 dødsfall<pl> relatert<pl> til pandemien
- … stemte med mannens fingeravtrykk<nt> funnet<nt> hos …
Difficult, since e.g.: ble det ved flere anledninger<pl> tatt<nt> opp i Stortinget
- Maybe we can use ngram frequencies f(ta beslutninger) >>> f(ta anledninger)
- Current solution: We treat sequences `n.ind pp.@o-pred` as
agreeing. So a participle tagged as object predicate will agree in
this manner, but this fails in e.g. «med dødsfall relatert til»
where «dødsfall» is tagged `@←p-utfyll` and so «relatert» is just
tagged `@adv` (maybe syn.rlx could have an `@a-pred`, but this is
in general a difficult attachment problem)
- adj.sg.nt.@adv should be adv_movable
- domineres sterkt → er sterkt dominert
- Crossing subject are difficult:
1. Samtidig skal vi ha respekt for den politiske plattformen<xubj> som de fire partiene<subj> fremforhandlet på Granavollen, og som ble godkjent<plattform> av partienes organer
2. Antallet<subj> mennesker<xubj> som døde<menneske> i etterkant, var kraftig underrapportert<antall>.
We could use commas as a signal to switch to a previous subject, which would fix the above, but
but commas after "non-optional" relatives should *not* lead to a switch:
3. Han<subj> opplyser at alle flyvinger<xubj> som var planlagt<flyvning> tirsdag, er innstilt<flyvning>.
4. Regjeringa<subj> stadfestar no at lisenskontoret<subj> i Mo i Rana, som har<lisenskontoret> 106 arbeidsplassar, legges<lisenskontoret> ned.
5. Tidligere i februar ble det<subj> kjent at seks politifolk<xubj> som jobbet for å avdekke narkotikakriminalitet, var pågrepet<folk> for selv å ha smuglet narkotika.
(Though in those cases, we have «at.cnjcoo.clb» on which to empty prev-subj.)
But it's hard to consistently tag these commas as clb; e.g. below we
have an "optional" relative sentence where we do want to switch subjects
on the comma:
6. Den ene såkalte svarte boksen<subj> fra flyet<xubj> som styrtet<fly> i Etiopia, er funnet<boks>, melder etiopiske medier.
(Note: In example 1. we have xubj-subj-ref:subj-ref:xubj, whereas
in 6. we have subj-xubj-ref:xubj-ref:subj, ie. we can't always
expect the relative subject to be the inner one.)
Possible idea: split @clb-tag into clause-with-subject vs clause-without-subject (ellipsis)?
- Uncommon, but referent may be far to the right:
- Desto lenger ut i kampen vi<pl> er, desto mer sliten<sg> blir muskulaturen<sg>.
- Mange har reist langveisfra, spesielt populær<mf> kan det virke som om Foyle<mf> er i Australia.
|