Package sentence :: Module dataset :: Class DataSet
[hide private]
[frames] | no frames]

Class DataSet

source code

object --+
         |
        DataSet
Known Subclasses:

A wrapper over a list of parallelsentences. It offers convenience functions for features and properties that apply to the entire set of parallelsentences altogether

Instance Methods [hide private]
 
__eq__(self, other)
@todo comparison doesn't really work
source code
 
__init__(self, content=[], attributes_list=[], annotations=[])
x.__init__(...) initializes x; see help(type(x)) for signature
source code
 
__iter__(self)
A DataSet iterates over its basic wrapped object, ParallelSentence
source code
 
_retrieve_attribute_names(self) source code
 
add_attribute_vector(self, att_vector, target='tgt', item=0) source code
 
append_dataset(self, add_dataset)
Appends a given data set to the end of the current dataset in place
source code
 
clone(self) source code
 
compare(self, other_dataset, start=0, to=None)
Compares this dataset to another, by displaying parallel sentences in pairs
source code
 
confirm_attributes(self, desired_attributes=[], meta_attributes=[])
Convenience function that checks whether the user-requested attributes (possibly via the config file) exist in the current dataset's list.
source code
 
ensure_judgment_ids(self)
Processes one by one the contained parallel sentences and ensures that there are judgment ids otherwise adds an incremental value
source code
 
get_all_attribute_names(self) source code
 
get_annotations(self) source code
 
get_attribute_names(self) source code
 
get_discrete_attribute_values(self, discrete_attribute_names) source code
 
get_head_sentences(self, n) source code
 
get_multisource_strings(self) source code
 
get_nested_attribute_names(self) source code
 
get_parallelsentences(self) source code
dict(String, list(sentence.parallelsentence.ParallelSentence))
get_parallelsentences_per_sentence_id(self)
Group the contained parallel sentences by sentence id
source code
dict
get_parallelsentences_with_judgment_ids(self)
Parallel sentences often come with multiple occurences, where a judgment id is unique.
source code
 
get_singlesource_strings(self) source code
 
get_size(self) source code
 
get_tail_sentences(self, n) source code
 
get_target_strings(self) source code
 
get_translations_count_vector(self) source code
 
import_target_attributes_onsystem(self, dataset, target_attribute_names, keep_attributes_general=[], keep_attributes_source=[], keep_attributes_target=[]) source code
 
merge_dataset(self, dataset_for_merging_with, attribute_replacements={}, merging_attributes=['id'], merge_strict=False, **kwargs)
It takes a dataset which contains the same parallelsentences, but with different attributes.
source code
 
merge_dataset_symmetrical(self, dataset_for_merging_with, attribute_replacements={}, confirm_attribute='')
Merge the current dataset in place with another symmetrical dataset of the same size and the same original content, but possibly with different attributes per parallel sentence
source code
 
merge_references_symmetrical(self, dataset_for_merging_with) source code
 
modify_singlesource_strings(self, strings=[]) source code
 
modify_target_strings(self, strings=[]) source code
 
remove_ties(self)
Modifies the current dataset by removing ranking ties
source code
 
select_attribute_names(self, expressions=[]) source code
 
split(self, ratio) source code
 
write_singlesource_strings_file(self, filename=None) source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Instance Variables [hide private]
[str, ...] attribute_names
(optional) keeps track of the attributes that can be found in the contained parallel sentences
boolean attribute_names_found
remembers if the attribute names have been set
[ParallelSentence, ...] parallelsentences
a list of the contained parallel sentence instances
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, content=[], attributes_list=[], annotations=[])
(Constructor)

source code 

x.__init__(...) initializes x; see help(type(x)) for signature

Parameters:
  • parallelsentence_list ([ParallelSentence, ...]) - the parallelsentences to be wrapped in the dataset
  • attributes_list - if the names of the attributes for the parallelsentences are known, they can be given here, in order to avoid extra processing. Otherwise they will be computed when needed. @type [str, ...]
  • annotations - Not implemented @type list
Overrides: object.__init__

append_dataset(self, add_dataset)

source code 

Appends a given data set to the end of the current dataset in place

Parameters:
  • add_dataset - dataset to be appended

confirm_attributes(self, desired_attributes=[], meta_attributes=[])

source code 

Convenience function that checks whether the user-requested attributes (possibly via the config file) exist in the current dataset's list. If not, raise an error to warn him of a possible typo or so.

Parameters:
  • desired_attributes - attributes that need to participate in the ML process
  • meta_attributes - attributes that need not participate in the ML process (meta)

get_parallelsentences_per_sentence_id(self)

source code 

Group the contained parallel sentences by sentence id

Returns: dict(String, list(sentence.parallelsentence.ParallelSentence))
a dictionary with lists of parallel sentences for each sentence id

get_parallelsentences_with_judgment_ids(self)

source code 

Parallel sentences often come with multiple occurences, where a judgment id is unique. This functions returns a dictionary of all the parallel sentences mapped to their respective judgment id. If a judment id is missing, it gets assigned the incremental value showing the order of the entry in the set.

Returns: dict
A dictionary of all the parallel sentences mapped to their respective judgment id.

merge_dataset(self, dataset_for_merging_with, attribute_replacements={}, merging_attributes=['id'], merge_strict=False, **kwargs)

source code 

It takes a dataset which contains the same parallelsentences, but with different attributes. Incoming parallel sentences are matched with the existing parallel sentences based on the "merging attribute". Incoming attributes can be renamed, so that they don't replace existing attributes.

Parameters:
  • dataset_for_merging_with (DataSet) - the data set whose contents are to be merged with the current data set
  • attribute_replacements (list of tuples) - listing the attribute renamings that need to take place to the incoming attributes, before the are merged
  • merging_attributes (list of strings) - the names of the attributes that signify that two parallelsentences are the same, though with possibly different attributes

merge_dataset_symmetrical(self, dataset_for_merging_with, attribute_replacements={}, confirm_attribute='')

source code 

Merge the current dataset in place with another symmetrical dataset of the same size and the same original content, but possibly with different attributes per parallel sentence

Parameters:
  • dataset_for_merging_with (DataSet) - the symmetrical dataset with the same order of parallel sentences
  • attribute_replacements ({str, str; ...}) - a dict of attribute replacements that need to take place, before merging occurs