Package sentence :: Module dataset :: Class DataSet

Class DataSet

object --+
         |
        DataSet

Known Subclasses:

A wrapper over a list of parallelsentences. It offers convenience functions for features and properties that apply to the entire set of parallelsentences altogether

Instance Methods

[hide private]

__eq__(self, other)
@todo comparison doesn't really work

source code

__init__(self, content=[], attributes_list=[], annotations=[])
x.__init__(...) initializes x; see help(type(x)) for signature source code

__iter__(self)
A DataSet iterates over its basic wrapped object, ParallelSentence

source code

_retrieve_attribute_names(self)

source code

add_attribute_vector(self, att_vector, target='tgt', item=0) source code

append_dataset(self, add_dataset)
Appends a given data set to the end of the current dataset in place

source code

clone(self)

source code

compare(self, other_dataset, start=0, to=None)
Compares this dataset to another, by displaying parallel sentences in pairs

source code

confirm_attributes(self, desired_attributes=[], meta_attributes=[])
Convenience function that checks whether the user-requested attributes (possibly via the config file) exist in the current dataset's list. source code

ensure_judgment_ids(self)
Processes one by one the contained parallel sentences and ensures that there are judgment ids otherwise adds an incremental value

source code

get_all_attribute_names(self)

source code

get_annotations(self)

source code

get_attribute_names(self)

source code

get_discrete_attribute_values(self, discrete_attribute_names)

source code

get_head_sentences(self, n)

source code

get_multisource_strings(self)

source code

get_nested_attribute_names(self)

source code

get_parallelsentences(self)

source code

dict(String, list(sentence.parallelsentence.ParallelSentence))

get_parallelsentences_per_sentence_id(self)
Group the contained parallel sentences by sentence id

source code

dict

get_parallelsentences_with_judgment_ids(self)
Parallel sentences often come with multiple occurences, where a judgment id is unique.

source code

get_singlesource_strings(self)

source code

get_size(self)

source code

get_tail_sentences(self, n)

source code

get_target_strings(self)

source code

get_translations_count_vector(self)

source code

import_target_attributes_onsystem(self, dataset, target_attribute_names, keep_attributes_general=[], keep_attributes_source=[], keep_attributes_target=[]) source code

merge_dataset(self, dataset_for_merging_with, attribute_replacements={}, merging_attributes=['id'], merge_strict=False, **kwargs)
It takes a dataset which contains the same parallelsentences, but with different attributes. source code

merge_dataset_symmetrical(self, dataset_for_merging_with, attribute_replacements={}, confirm_attribute='')
Merge the current dataset in place with another symmetrical dataset of the same size and the same original content, but possibly with different attributes per parallel sentence source code

merge_references_symmetrical(self, dataset_for_merging_with)

source code

modify_singlesource_strings(self, strings=[]) source code

modify_target_strings(self, strings=[]) source code

remove_ties(self)
Modifies the current dataset by removing ranking ties

source code

select_attribute_names(self, expressions=[]) source code

split(self, ratio)

source code

write_singlesource_strings_file(self, filename=None)

source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Instance Variables

[hide private]

[str, ...]

attribute_names
(optional) keeps track of the attributes that can be found in the contained parallel sentences

boolean

attribute_names_found
remembers if the attribute names have been set

[ParallelSentence, ...]

parallelsentences
a list of the contained parallel sentence instances

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, content=`[]`, attributes_list=`[]`, annotations=`[]`)
(Constructor)

source code

x.__init__(...) initializes x; see help(type(x)) for signature

Parameters:

parallelsentence_list ([ParallelSentence, ...]) - the parallelsentences to be wrapped in the dataset
attributes_list - if the names of the attributes for the parallelsentences are known, they can be given here, in order to avoid extra processing. Otherwise they will be computed when needed. @type [str, ...]
annotations - Not implemented @type list

Overrides: object.__init__

append_dataset(self, add_dataset)

source code

Appends a given data set to the end of the current dataset in place

Parameters:

add_dataset - dataset to be appended

confirm_attributes(self, desired_attributes=`[]`, meta_attributes=`[]`)

source code

Convenience function that checks whether the user-requested attributes (possibly via the config file) exist in the current dataset's list. If not, raise an error to warn him of a possible typo or so.

Parameters:

desired_attributes - attributes that need to participate in the ML process
meta_attributes - attributes that need not participate in the ML process (meta)

get_parallelsentences_per_sentence_id(self)

source code

Group the contained parallel sentences by sentence id

Returns: dict(String, list(sentence.parallelsentence.ParallelSentence)): a dictionary with lists of parallel sentences for each sentence id

get_parallelsentences_with_judgment_ids(self)

source code

Parallel sentences often come with multiple occurences, where a judgment id is unique. This functions returns a dictionary of all the parallel sentences mapped to their respective judgment id. If a judment id is missing, it gets assigned the incremental value showing the order of the entry in the set.

Returns: dict: A dictionary of all the parallel sentences mapped to their respective judgment id.

merge_dataset(self, dataset_for_merging_with, attribute_replacements=`{}`, merging_attributes=`['id']`, merge_strict=False, **kwargs)

source code

It takes a dataset which contains the same parallelsentences, but with different attributes. Incoming parallel sentences are matched with the existing parallel sentences based on the "merging attribute". Incoming attributes can be renamed, so that they don't replace existing attributes.

Parameters:

dataset_for_merging_with (DataSet) - the data set whose contents are to be merged with the current data set
attribute_replacements (list of tuples) - listing the attribute renamings that need to take place to the incoming attributes, before the are merged
merging_attributes (list of strings) - the names of the attributes that signify that two parallelsentences are the same, though with possibly different attributes

merge_dataset_symmetrical(self, dataset_for_merging_with, attribute_replacements=`{}`, confirm_attribute=`''`)

source code

Merge the current dataset in place with another symmetrical dataset of the same size and the same original content, but possibly with different attributes per parallel sentence

Parameters:

dataset_for_merging_with (DataSet) - the symmetrical dataset with the same order of parallel sentences
attribute_replacements ({str, str; ...}) - a dict of attribute replacements that need to take place, before merging occurs

Class DataSet

__init__(self, content=[], attributes_list=[], annotations=[]) (Constructor)

append_dataset(self, add_dataset)

confirm_attributes(self, desired_attributes=[], meta_attributes=[])

get_parallelsentences_per_sentence_id(self)

get_parallelsentences_with_judgment_ids(self)

merge_dataset(self, dataset_for_merging_with, attribute_replacements={}, merging_attributes=['id'], merge_strict=False, **kwargs)

merge_dataset_symmetrical(self, dataset_for_merging_with, attribute_replacements={}, confirm_attribute='')

init(self, content=`[]`, attributes_list=`[]`, annotations=`[]`)
(Constructor)

confirm_attributes(self, desired_attributes=`[]`, meta_attributes=`[]`)

merge_dataset(self, dataset_for_merging_with, attribute_replacements=`{}`, merging_attributes=`['id']`, merge_strict=False, **kwargs)

merge_dataset_symmetrical(self, dataset_for_merging_with, attribute_replacements=`{}`, confirm_attribute=`''`)