Parla - the most debated topics in the Brazilian House of Representatives

About Senado Federal of Brazil

Strategic Partners

The revolution of “big data” and artificial intelligence provided a great opportunity for new ways of making parliamentary speeches and debates available. As MPs express their opinions and defend their positions through words, it is possible to use the frequency of these words and expressions in parliamentary speeches to show which themes were most debated in the House of Representatives in Brazil. 

 

With this purpose, Parla was created, which uses the speeches given in the Plenary to show which were the most debated topics during the legislature. The objective is to provide a faithful picture of what its representatives spoke at the House of all brazilians. To ensure an accurate picture of what each MP says, it was necessary to use two different methods: the word bag and the Naive Bayes/Decision Tree, as there is no global method for automated content analysis (GRIMMER; STEWART, 2013 ).

 

The word bag measures only the frequency of words in the MP’s speech. To maintain the importance of word order in the speech, Parla only shows the most frequently used expressions between two to five words. As single words did not bring expressive content, they were removed from the model and each set of up to five words was considered as a “single word”. The objective is to create a matrix of documents and terms (Document Term Matrix – DTM), in which each row represents a document and each column represents a single term – bigram, trigram or even 5-gram. In this sense, each cell of the matrix denotes the number of times that each of these linguistic terms of up to five words, indicated in the column, appears in the document indicated in the line; consequently, each document is represented by a unique vector.

Naive Bayes/Decision Tree is a supervised method of machine learning. Supervised learning methods use the frequency at which words appear in a text to classify documents into predetermined categories. The algorithm then “learns” how to classify documents into these categories using a training set. In other words, the algorithm uses document characteristics to classify them into categories (GRIMMER; STEWART, 2013). 

 

Parla used 6,200 sentences, manually classified by the speech indexing staff, as a test set to learn to identify which was the most debated topic within a pre-established set of 31 themes.

 

Based on Bayes’ theorem, Naive Bayes is one of the most used supervised classification methods in the Political Science literature. Although it starts from a naive assumption — the model assumes that words are generated independently for a given category (the naive assumption), when in fact word usage is highly correlated in any given set of data — the model provides a method useful alternative to assign documents to predetermined categories (GRIMMER; STEWART, 2013; IZUMI; MOREIRA, 2018). 

 

In the case of large collections, the Decision Tree classifier can be used together with Naive Bayes to increase accuracy (KOHAVI, 2011). 

 

This algorithm “asks questions” about the data until it can filter the information enough to make a prediction. In the case of speeches, the branches of the tree are defined by the bags of words of all the sentences and organized in an optimized way so that, when it receives a new bag of words, the algorithm can filter and predict which theme it belongs to.

 

To apply the Naive Bayes/Decision Tree, it was necessary to first define the themes and then teach the algorithm to classify on these themes. In principle, the House of Representatives’ table of classification and indexing of themes was used. Afterwards, the themes “Homages and Commemorative Dates” and “Legislative Process and Parliamentary Performance” were removed, as they contain the activities of parliamentary representation, not contributing with content to the thematic debate. Two other themes – “Public Administration” and “Politics, Parties and Elections” – were considered too broad, but as these themes have significant content, they were divided. 

 

Thus, the topic “Public Administration” was separated into “Impeachment”, “Corruption” and “Public Service” and the topic “Politics, Parties and Elections” was separated into “Political Reform” and “Election”. In this way, Parla presents 40% accuracy in the macro-thematic classifier and, on average, 70% in the smaller classifiers.

 

In order for the application of the two algorithms to be possible, it was necessary to carry out a series of pre-processing steps with the speeches. In order to reduce vocabulary complexity and size, as well as to focus on what is usual and meaningful in the text, Parla uses only medium frequency words in speech. Therefore, it was necessary to remove words with unnecessary content – ​​those that appear in 90% of speeches – and those that are infrequent – ​​in less than 1%. The most frequent ones usually do not generate significant content and correspond to the closed vocabulary of the Portuguese language – such as conjunctions, prepositions, articles, pronouns and verbs. Words and expressions that are very commonly used in the legislative process, but which do not generate significant content, are also removed, such as procedural speeches by the President, reading of the minutes and agenda, election of the Board, prayers and tributes. These word lists are called stopwords lists. In the case of Parla, the names of the states and the names of MPs were also removed.

 

After removing the stopwords, it is necessary to reduce the variability of the words through stemming. Stemming is the reduction of the word to its stem by removing its ending, as in plurals or verb conjugations, in order to reduce words to their basic form and group them together. 

 

This process of reducing the word to its stem will not necessarily result in the exact root. For this it is necessary to use a more complex algorithm that identifies the origin of the word and returns only its lemma or root. As the Portuguese language is quite complex, applying all the rules and exceptions of the process of reducing a term to its radical would make the algorithm very slow. After stemming, the individual occurrences of each word are called tokens and the speech content is finally ready to be converted into quantitative data.

 

To increase the usability of the tool, Parla provides filters that make it possible to separate speeches by party, by state or by men and women. Thus, it is possible to know what only female MPs say, or the MPs of a State, or compare what MPs from different parties say. Another option is to know what the MPs are naturally talking about, without interference from the parliamentary agenda or from party control. For this, Parla makes available a filter that captures only the speeches of deliberative and non-deliberative sessions, in which each MP is free to debate its own agenda.



Bibliographical Reference [ABNT 6023/2018]

GRIMMER, J.; STEWART, B. M. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, Oxford, UK, v. 21, n. 3, p. 267-297, 2013.

 

KOHAVI, R. Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. Data Mining and Visualization Silicon Graphics, Inc. Mountain View, CA, USA, 2011. 


MOREIRA, D.; IZUMI, M. O Texto como Dado: Desafios e Oportunidades para as Ciências Sociais. Revista Brasileira de Informação Bibliográfica em Ciências Sociais – BIB. São Paulo, BR, n. 86, 2018. 

Support this library and donate

Supporting and donating to this library is more than a contribution; it is a vote for the importance of freely accessible knowledge and a pledge to our shared intellectual growth. Each donation aids in the curation, preservation, and expansion of our resources, ensuring the continued availability of relevant and timely content. It helps us sustain the quality and breadth of our offerings, enabling us to serve our diverse community better. Your contribution signifies your commitment to fostering a vibrant, informed, and connected community, underpinned by the principle of equitable access to knowledge.